j68k / verifydump

A tool for verifying that .chd/.rvz disc images match Redump Datfiles
MIT License
51 stars 7 forks source link

Keep track of already-verified .chd files to avoid having to re-convert them every time #12

Open j68k opened 2 years ago

j68k commented 2 years ago

This idea was described by ZeroBANG on Reddit.

j68k commented 2 years ago

I think a nice way of implementing this would be to keep a cache file (perhaps in the folder with the .chd files or perhaps just in ~/.cache or something). It would record successful validations with the name of the .chd, the SHA-1 of the .chd, and a reference to the game in the Datfile.

j68k commented 2 years ago

The reference to the game in the Datfile would need to store enough information so that we could tell if the game has changed in a new version of the Datfile. A simple way of doing that would just be to copy the information about the individual files from the Datfile into the cache file and compare that with the current Datfile when running.

j68k commented 2 years ago

There could be an option to skip verifying the SHA-1 of previously verified .chd files, since that would just be checking for bitrot/accidental modification. Maaaybe that option should be on by default.

i30817 commented 2 years ago

In linux, i use extended attributes for this. Unfortunately, this wouldn't work in windows (windows does have a similar kind of metadata for ntfs, but it requires administrative rights). The advantage is that you can move the files without losing track of the metadata, have same named files with no problem, and it still doesn't change the files/checksum.

A disadvantage is that copying the files in the console, nukes the information instead of copying it unless you pass a uncommon switch (normal copy paste in nautilus does preserve them). Another disadvantage for chd in particular is that you'd have to set variables for all of the 'tracks' if any.

Another disadvantage is that very unfortunately, hardlinks share extended attributes instead of cloning them, so if you want to use them for different uses, you cannot/should not store 'information' that is not completely agnostic to those differences, ie: hardlink/original checksum == ok, different softpatches+original/hardlink checksum == not ok.

j68k commented 2 years ago

I thought about writing a standard sha1sum file next to the verified files, with a comment that includes an additional hash of all the game details from the Datfile. So that way verifydump could recompute that hash of the game details and see that the verification is still valid for the current Datfile, and you could also use the standard sha1sum tool to verify against bitrot if you wanted.

But I dunno if people would appreciate having a hash file in the folder with their .chds and having to learn what that's for, so it might just be simpler to write a custom file to ~/.cache or similar 🤔

i30817 commented 2 years ago

People will totally delete files they think are 'meaningless'. The only reason to have those files is the hope that as people move and rename the originals they also move and rename those files, and that's kind of a high expectation from lusers.

A .cache file is best if you want 'cross platform'. Problem is that these things are very very fragile if the user modifies/moves/renames the file. Having to repeat a bit of work is not the worst possibility - the worst possibility is the luser replacing a file with a different file with the same name and the cache saying 'this is the checksum' when the file is now completely different.

I prefer to use the extended attribute method so i don't track or need to track all that noise.

Of all the things microsoft didn't steal from unix, they chose not to steal two of the most useful (symbolic/hard links and extended attributes), so now all the 'cross platform' people are condemned to rely on fallible or inefficient methods.

Personally, i just accept that the 'windows version' will have no cache because i'm a linux only guy and because i don't want to bother with complexity that i don't use.

j68k commented 2 years ago

Yeah, really good points. I think it's best then if verifydump just does the caching of results behind the scenes to keep things simple for the users. If it can use that cache to speed things up then great, and if not because things have moved around then it'll just slow things down somewhat for the that first verification after the move.

i30817 commented 2 years ago

Maybe you can store the 'date modified' of the file to invalidate cache entries that have differences, even if the file exists. Complexity, but from what i've seen of your program, you handle that well.

I also kind of recommend you don't use filenames as 'keys' but whole paths. It's less forgiving of moves, but users will often make duplicates in other paths and they're not always the 'same' (they're extracting several versions of a game dump for example).

j68k commented 2 years ago

Oh that's a nice idea. I think if the file path, size and modification time match what we verified already then that's a pretty strong signal that it's the same file. So it would be sensible to skip verifying the .chd SHA-1 in those cases, with an option to override and do the check anyway if the user wants. I'll give that a try. Thanks!

i30817 commented 2 years ago

Speaking of chd, i'm tempted to do a zerg rush of opening bug reports in the emulators that say they support the format, because they aren't supporting 'delta chds' because those require loading the original to get them working.

MAME does this with software lists, but other emulators that aren't so obsessive compulsive about verifying every last dump they load, should have at least the option to make loading a delta chd activate a scan of the directory where they are (at least) to find the parent. Without this delta chd is rather pointless in emulators because you'd need to have the users specifying the parent (no one would use delta chds).

Although i think this is not a problem just for verifying delta chds, because i think their data sha1 is the correct even if they don't have the data, you have the problem that you're not using the data sha1 but calculating the track manually... so you probably should do this when you have the possibility.

Maybe it's best to wait for the fabled python api for chd though (or make sure than when you write it, you consider this case).

j68k commented 2 years ago

Sorry for the slow reply. I have to admit I haven't looked into delta chds yet. Are they useful in the context in managing Redump collections? Can you do something fancy like have the different regions of the same title using smaller delta files?

i30817 commented 2 years ago

They're more useful for older games, because modern virtual filesystems compress their data, and slight changes earlier in the stream, affect the compressed bytes later in the stream.

In effect, delta chds function as a form of softpach for cds, and yes, they can be used to reduce chd size if that problem in the first paragraph doesn't occur. For instance, imagine a game with two cds that only has two cds because the devs tried, but couldn't manage to fit it all in one.

Most of the data is similar, so making the second cd image a delta chd of the first, could be effective.

It would also be useful for user deltas for romhacks, if it had a way to directly apply a romhack to a chd and produce a delta - but it doesn't, that'll have to be us peons doing that when a libchdr like library, but for writing happens. It's a shame because this is a hard process, as you can probably tell from the effort you made in your utility to divide the chd output image. The utility will probably have to take a track number or none, to know which track (or none) to apply the xdelta to.

Also note, that libchdr (the library) has a extension for a theoretical 'more than 1 level' of delta-chd(s) for the cases were for example, you want to do that cd1-2 trick but also want a romhack, the romhack of cd2 could have to fetch bytes from romhack->cd2->cd1. MAME didn't adopt this afaik, so it's dead in the water, because MAME is the only chd writer for now. I'm really not sure about this, it's possible it just works out if you can create a delta of a delta with chdman.

This is also the only cd delta/softpatch format i know. Retroarch absolutely refuses to load in memory cds/dvs and apply ups/bps/ips to them, probably rightfully because those formats are not ready for the size of a cd/dvd image (or because retroarch would simply crash if it tried to load 9gb in memory in most devices).

Since delta-chd is part of the fileformat spec and is actually using the filesystem, not just dumping dvds into memory like all the other 'softpatching' it's likely they'll eventually be supported just from using the library.

But it requires actual effort in the emulators because there is a 'scan' to be done and a decision of how far to 'scan' to find the 'parent', because MAME uses rom lists of their own effort, which means they 'know' which file is the parent, because when loading the delta, they also gave the parent filename. After scanning for chds (in whatever way you want) you can know the 'parent' because the delta chd has the sha1sum of the parent in the header (iirc).

This is not done by the library as far as i know, because it requires interacting with the filesystem, and we all know how filesystems can be variable per platform (android would get permission denied if it tried to use a delta chd with just permission for that file). It would be useful for users if this was consistent. Note that delta-chds are slightly different from ips/bps/ups files because the 'softpatch' is a game, so it's better they don't have the same name and different extension, because they're different hacks/games, so the normal way softpatches are recognized in emulators is not appropriate.

Z-95 commented 1 year ago

I've implemented this locally since I really wanted to speed up re-scans and don't mind sharing what I wrote.

What I did:

Added new optional arg cache_file that is a path to a file that will hold cached verified file details in a sqlite db (lets user put it wheverver they want--I have a .meta dir per system that holds the dat, cuefiles if available, script for running verifydump, and now the cache db). If arg is not supplied then program runs as it does now; if supplied see below.

For CHDs: If a chd is successfully verified, store the file name, file size, and file modified time along with the cue_verification_result, game name from the dat, and for each rom in that game in the dat the rom name, rom size, and rom hash. For re-scans, check if the chd file name exists in the cache and if the file size and modification time match. If so, check if the new parsed dat has a game name that matches the cached one and check this game's rom list for the correct number of roms, rom names, rom sizes, and rom sha1s. If they all match up, use the cached cue_verification_result and game from the parsed dat and skip the actual file analysis. If anything does not match then fully analyze the file as before.

For RVZs: If a rvz is successfully verified, store the file name, file size, and file modified time along with the matching sha1 of the uncompressed file. For re-scans, check if the rvz file name exists in the cache and if the file size and modification time match. If so, use the cached sha1 of the uncompressed file and skip the file analysis. If anything does not match then fully analyze the file as before.

I can put this in a PR if you want--not saying you have to implement it exactly the same as I just hit the use cases I could think of that I needed but it could give you some ideas. For instance, I didn't add a mode to do a full verification with caching on since you can just delete the cache db and get the same results.

j68k commented 1 year ago

Thank you for doing that work @Z-95. I'm not sure if I'll implement it the exact same way (mainly I think I'd have the cache files in ~/.cache or similar so that users won't have to be aware of the cache files to benefit from them), but it would be great to be able to refer to your work and use parts of it so if you'd like to fork the project on GitHub and publish your changes there then that would be very welcome. Thanks again!

Z-95 commented 1 year ago

Sounds good! I'll actually submit a PR from my fork as well so you or others can easily reference. I'll also set it so you have modification access so you can change it if you want or you can delete the PR if it rots or is no longer needed.

j68k commented 1 year ago

Perfect, thank you!