Multi-threaded game scanner

athairus commented 8 years ago

Is it possible? Can we get more performance from making a thread pool and assigning each thread an individual scanning task? Can we hit the DB from multiple threads at once? Will the HDD/SSD still be a bottleneck even with this going?

athairus commented 8 years ago

This problem lends itself well to MapReduce:

Step 1: enumerate files/folders
input: QList of file and folder paths
output: QList of file paths
    map(): pass individual file paths through unchanged, (recursively?) enumerate directory paths and turn into a QList of file paths
    reduce(): build a QList by appending (start from empty)

Step 2: parse archive files
input: QList of file paths
output: QList of file paths, 
        QList of (file path, hash) tuples
    map(): pass non .zip file paths through unchanged
           introspect .zip files, enumerate and turn into a QList of file paths and a QList of (file path, hash) tuples
    reduce(): build QList by appending (start from empty)

Step 3: check .cue files
input: QList of file paths, 
       QList of (file path, hash) tuples
output: QList of (game UUID, file path, system UUID) tuples, 
        QList of (file path, system UUID) tuples, 
        QList of (file path, system UUID QList) tuples, 
        QList of unknown file paths, 
        QList of .bin file paths
    filter(): all file paths that end in .cue
    map(): enumerate all .bin file paths, return .bin file path QList (assume within folder)
           apply step 4's map()
    reduce(): build QLists by appending (start from empty)

Step 4: check all other files
input: QList of file paths, 
       QList of (file path, hash) tuples, 
       QList of .bin file paths, 
       QList of game UUIDs, 
       QList of (file path, system UUID) tuples, 
       QList of (file path, system UUID QList) tuples, 
       QList of unknown file paths
output: QList of (game UUID, file path, system UUID) tuples, 
        QList of (file path, system UUID) tuples, 
        QList of (file path, system UUID) tuples, 
        QList of unknown file paths
    filter(): from QList of file paths: 
              all non .cue file paths, 
              all .bin file paths that don't match given QList of .bin file paths
    map(): hash file (check QList first), compare to db. If hit, return game UUID. 
           If miss, compare file name to database (how fuzzy should we match? definitely skip any match whose extensions differ) If hit, return game UUID. 
           If miss, check header/signatures. If hit, return (file path, system UUID) tuple. 
           If miss, check extension against system (libretro) db. If hit, return (file path, system UUID) iff one possible system. 
           If miss, and miss was due to there being multiple possible systems, return (file path, system UUID QList) tuple.
           Otherwise, return as an unknown file path.
    reduce(): build QLists by appending (start with input list values)

athairus commented 8 years ago

This is being worked on in the feature-importer-mt branch.

athairus commented 8 years ago

Currently works like this:

Step 1: enumerate files/folders
input:  QList of file and folder paths
output: FileList (QList<FileEntry>)
    map(): pass individual file paths through unchanged, recursively enumerate directory paths and turn into a QList of file paths
    reduce(): append given list to the output list

Step 2: parse archive files
input:  FileList (step 1's output)
output: FileList
    map(): pass non .zip file paths through unchanged
           introspect .zip files, enumerate and turn into a QList of FileEntries for each file in the archive file, caching the checksum along the way
    reduce(): append given list to the output list

Pre step 3: make copy of step 2's output, save for post step 3

Step 3: parse .cue files
input:  FileList (step 2's output)
output: FileList (only .bin files)
    map(): enumerate all .bin file paths, return .bin file path QList (assume within folder)
    reduce(): append given list to the output list

Post step 3: convert step 2's output and step 3's output into sets, subtract step 3's output from step 2's output

Step 4: match against database
input:  FileList (step 2's output - step 3's output)
output: FileList
    map(): hash file (if hash not already cached -- in other words, anything not stored in an archive file), compare to db. If hit, write game & system UUIDs to FileEntry and mark result.
           If miss, compare file name to database (how fuzzy should we match? definitely skip any match whose extensions differ) If hit, write game & system UUIDs to FileEntry and mark result. 
           If miss, check header/signatures. If hit, write system UUID(s) to FileEntry and mark result.
           If miss, check extension against system (libretro) db. If hit, write system UUID(s) to FileEntry and mark result. 
           Otherwise, mark as an unknown file.
    reduce(): append given list to the output list

Post step 4: Pass resulting FileList to user for review

athairus commented 8 years ago

Merged into master.

team-phoenix / Phoenix

Multi-threaded game scanner #206