RFE: Reliable data driven alternative to scanner heuristics 2.0

Description

This idea requires massive changes to the scanner, and the playlist GUI or playlist fileformat, but i think it'd be worth it. Recently there have been some requests for a 'playlist/console scanner' option for the GUI on the filesystem and this could be used to implement that too.

This is a restatement of https://github.com/libretro/RetroArch/issues/8672 into a single post.

glossary: RA: retroarch LF (launcher file) is the file that appears on a playlist to be given to the core. Often a rom but can be something else too in this proposal. There can be multiple LF per game. DF (detect file) is a file with a fileformat holding a language to describe the expected name of a LF and optionally a mapping from instances of that match to a libretro-database entry. This optional mapping is needed because LF are most often editable and not part of a game distribution, and also often not modified by hacks, so they're not suitable to identify a game. SF (signature files) the files that libretro-database uses to identify a game. There can be multiple of these per game, for instance DOS games may have a installer and game SFs.

Expected behavior

This idea is orthogonal and cannot replace scanner heuristics that map a game rom/iso to a playlist (ie: console identification) completely, if you want users that didn't organize their games on the filesystem to 'scan' a random mishmash of rom types. It's just a way to disable that mess and take control by making the scanner only consider roms of a console below or in a directory. If you have any idea to make this type of configuration replace the heuristics, please comment, because that is the messiest part of the scanner code from what i've seen.

As a aside to this proposal, if you have, for instance a game with a cue referencing a iso (for mednafen), and both are valid LF; both appear on the playlist. 'Filtering' this is not handled in this idea but could be done after having the whole collection by a understanding of certain files. For instance after acquiring a cue or m3u file for a playlist, files referenced there could be hidden from the playlist and their metadata added to the corresponding cue or m3u (if missing and existing on the files pointed to).

The idea

The main idea is that the users organize their games per platform on the filesystem and then when the scanner is given a start directory to scan, the recursive function doing that gains a stack argument and does this:

On entering a dir:
  warn if > 1 DF exist (since DF should handle multiple valid formats one per line)
  if a file with extension $PLAYLIST.df exists, where $PLAYLIST is a valid playlist name for a console.
    read the $PLAYLIST.df file and extract which lines are written on it. 
    add a struct with the Playlist and lines to the top of the stack
  if there is a struct on top of the stack whitelist files for a fixed playlist using the lines, and possibly associate a libretro-database mapping.
  else original code goes here using the old heuristics to find the 'right' playlist
Prior to leaving a dir (returning from the recursion):
 if any '$PLAYLIST.df' exists on the top of the stack and on this dir, pop it

Detect file format

Each line of text, maps to a either a single file, or a list of files and those optionally map to a single libretro-database entry (ideally) or multiple (if there is no other option).

If a line has no mapping ( => ) but just a single file with a glob or not, that file serves as both LF and SF.

Matching lines (for the files, not database entries) remove the leftmost file (LF) in the mapping from the pool of the next line; for performance, and correctness. Again for performance the scan should cache calculated checksums in the case a file doesn't match a line, though the XATTR proposed method would be a longer lived cache and i'd use it when possible.

+ is a metacharacter for globbing that doesn't cross directories (filename only), and which if it appears on 'both' sides of a mapping is restricted to only add a playlist entry if it's the same on both sides. This is done to allow + globbed libretro-database mappings to map to different metadata entries than the first; being essentially a way to allow users to 'connect' a LF to a SF and not have to create DF for every single game if they use LF, 'just' name/rename the LF correctly.

* is a metacharacter for (directory only) globbing that should only appear on the right-hand side of a line, does cross directories, and is potentially empty. It's a way for the SF side of a line to search for a directory tree, not only a single directory, and to still allow the 'both sides' rule of '+' to take effect.

It's best if the system directory separator here is '/' a 'fixed' choice so the files work on unix and windows, and best if the lines are matched to files with case insensitivity.

In the libretro-database mapping it might be helpful to allow different methods (CRC32, NAME, MD5), on the SF and it might be helpful to allow a fixed entry without a search. If no suffix is given, no attempt is done to fetch metadata.

Of special interest here is these libretro-database matching methods can be extended. For instance if in the future the database get support for CHD sha1 internal checksum, you could have a 'CHD' method coded, or a 'PS1SERIAL' or even a unix only 'XATTR' to reuse a checksum recorded in the file that supports softpatches etc.

To make it clear, some examples:

Sony - Playstation.df with content

+.cue => +.bin:CRC32

In and under the DF dir, search for any cue file as LF and get the metadata in libretro-database by looking for '.bin' files CRC32 on the same dir and the same name (minus extension).

Sony - Playstation.df with content

+.cue => + (Track 1).bin:CRC32 (note the space)
+.cue => +.bin:CRC32

As above, but you 'know' that you have redump files, therefore you can have separate tracks if the game has digital audio.

NEC - Turbografx-16.df with content

+.cue => + (Track 2).bin:CRC32
+.cue => +.bin:CRC32

Turbografx Cds need to use the second track to identify because that is where the game actually is

DOS.df with contents

dosbox.conf => game/+.exe:CRC32

For any fixed name dosbox.conf file in and under the DF dir, use it as LF and get the metadata in libretro-database by looking for the first executables CRC32 you find under the 'game' subdir in the DOS database and place it on the 'DOS' playlist.

As you probably noticed this case the metadata is uncertain, because there might be more than 1 executable in the database than matches. Resolving this ambiguity can be done by using a fixed DOS.df with contents dosbox.conf => game/game.exe:CRC32 but that is not scalable, thus the allowance to show 'a' metadata entry in this case.

DOS.df with contents

+.conf => game/*+.exe:CRC32

Speaking of scalable - here the '+' both side rules apply but the * allows the right hand side to search the whole game/ directory tree for all executables, which is much more reliable.

DOS.df with contents

+.conf => game/game.exe:CRC32

If there are two or more conf files in the dir and they match that exe, they're all entered with the same metadata but different LF. This also shows why only the LF is removed from the pool of possible files after a line; other LFs could use the same SF, but no line should match the same LF again.

DOS.df with contents

dosbox.conf => game/game.exe:84358932

A fixed mapping, where the libretro database is only used to fetch a already calculated crc. Not that useful a idea, but may be useful if a there is a several gb file you don't want to rescan or a hack not on libretro-database and force it to use a particular metadata entry.

Sony - Playstation.df with content

+.cue => + (Track 1).bin:CRC32
+.m3u

As was mentioned before, but i want to emphasize, DF can have multiple lines for multiple types of allowed LF, and m3us are a 'special case' of LF where their libretro-database mapping requires parsing and filtering, which would be inappropriate to this fileformat, so it's simply not done here.

After the scanner is over (or the playlist is shown), the m3us would be parsed and entries in it that match LF on the playlist would have those LF hidden and their metadata used as metadata of the m3u file. This is the main part of this scheme that I think requires GUI code modification, specifically the code to hide entries, though it should be done anyway if you want m3u support in the playlist.

I think it cannot be done on the playlist format because the metadata of individual files in the m3u is cached there, and removing them from there would necessitate a new scan unless the playlist format gets special support for m3u and multiple 'keys' for metadata, one per item on it.

Taking into account that the file mapping match removes the leftmost file from the possibilities below a complete redump + extras rule in linux could be:

+.cue =>  + (Track 1).bin:XATTR
+.cue =>  +.bin:XATTR
+.cue =>  + (Track 1).iso:XATTR
+.cue =>  +.iso:XATTR
+.cue
+.m3u

In order, from more to less specific: 2 normal redump mappings, hacked games that the patch turned to iso mappings (and the user named them iso instead of bin), remaining cues where the name of the cue has no relation to the bin/iso and so have no libretro-database candidate key (without parsing), then m3u that will filter out all of the others that match and consume their metadata after the scan. If possible, the metadata would be fetched by checking it on extended attributes first, and potentially saving it there if missing after calculation, so only the first scan is slow.

I think that's all; any suggestions or ideas? If you have any idea how to simplify the RA heuristics code beyond this I encourage you to post it.

edits:

removed the 'cue/iso' filtering. It's only needed in case of a mistake in the example, so i think it's not a good idea now. Added the rule that matching file mappings will be removed from consideration of the next lines and a example.
added a + operator to replace the and used to serve as a (possibly empty, possibly crossing directories) eager path glob.
clarified that lines without a mapping work as both LF and SF and that the :suffixes methods are optional and if none is given, no metadata is available for files matched, just a playlist entry.
added a note that files that don't match a line but calculated a checksum should obviously cache that information to prevent recalculation on the next line that tries to match.

libretro / RetroArch