Closed athairus closed 8 years ago
This problem lends itself well to MapReduce:
Step 1: enumerate files/folders
input: QList of file and folder paths
output: QList of file paths
map(): pass individual file paths through unchanged, (recursively?) enumerate directory paths and turn into a QList of file paths
reduce(): build a QList by appending (start from empty)
Step 2: parse archive files
input: QList of file paths
output: QList of file paths,
QList of (file path, hash) tuples
map(): pass non .zip file paths through unchanged
introspect .zip files, enumerate and turn into a QList of file paths and a QList of (file path, hash) tuples
reduce(): build QList by appending (start from empty)
Step 3: check .cue files
input: QList of file paths,
QList of (file path, hash) tuples
output: QList of (game UUID, file path, system UUID) tuples,
QList of (file path, system UUID) tuples,
QList of (file path, system UUID QList) tuples,
QList of unknown file paths,
QList of .bin file paths
filter(): all file paths that end in .cue
map(): enumerate all .bin file paths, return .bin file path QList (assume within folder)
apply step 4's map()
reduce(): build QLists by appending (start from empty)
Step 4: check all other files
input: QList of file paths,
QList of (file path, hash) tuples,
QList of .bin file paths,
QList of game UUIDs,
QList of (file path, system UUID) tuples,
QList of (file path, system UUID QList) tuples,
QList of unknown file paths
output: QList of (game UUID, file path, system UUID) tuples,
QList of (file path, system UUID) tuples,
QList of (file path, system UUID) tuples,
QList of unknown file paths
filter(): from QList of file paths:
all non .cue file paths,
all .bin file paths that don't match given QList of .bin file paths
map(): hash file (check QList first), compare to db. If hit, return game UUID.
If miss, compare file name to database (how fuzzy should we match? definitely skip any match whose extensions differ) If hit, return game UUID.
If miss, check header/signatures. If hit, return (file path, system UUID) tuple.
If miss, check extension against system (libretro) db. If hit, return (file path, system UUID) iff one possible system.
If miss, and miss was due to there being multiple possible systems, return (file path, system UUID QList) tuple.
Otherwise, return as an unknown file path.
reduce(): build QLists by appending (start with input list values)
This is being worked on in the feature-importer-mt
branch.
Currently works like this:
Step 1: enumerate files/folders
input: QList of file and folder paths
output: FileList (QList<FileEntry>)
map(): pass individual file paths through unchanged, recursively enumerate directory paths and turn into a QList of file paths
reduce(): append given list to the output list
Step 2: parse archive files
input: FileList (step 1's output)
output: FileList
map(): pass non .zip file paths through unchanged
introspect .zip files, enumerate and turn into a QList of FileEntries for each file in the archive file, caching the checksum along the way
reduce(): append given list to the output list
Pre step 3: make copy of step 2's output, save for post step 3
Step 3: parse .cue files
input: FileList (step 2's output)
output: FileList (only .bin files)
map(): enumerate all .bin file paths, return .bin file path QList (assume within folder)
reduce(): append given list to the output list
Post step 3: convert step 2's output and step 3's output into sets, subtract step 3's output from step 2's output
Step 4: match against database
input: FileList (step 2's output - step 3's output)
output: FileList
map(): hash file (if hash not already cached -- in other words, anything not stored in an archive file), compare to db. If hit, write game & system UUIDs to FileEntry and mark result.
If miss, compare file name to database (how fuzzy should we match? definitely skip any match whose extensions differ) If hit, write game & system UUIDs to FileEntry and mark result.
If miss, check header/signatures. If hit, write system UUID(s) to FileEntry and mark result.
If miss, check extension against system (libretro) db. If hit, write system UUID(s) to FileEntry and mark result.
Otherwise, mark as an unknown file.
reduce(): append given list to the output list
Post step 4: Pass resulting FileList to user for review
Merged into master.
Is it possible? Can we get more performance from making a thread pool and assigning each thread an individual scanning task? Can we hit the DB from multiple threads at once? Will the HDD/SSD still be a bottleneck even with this going?