sqlite3 by default uses insufficiently unique entries that can lead to skipped file downloads

orion486 commented 1 week ago

The Issue

I am not sure if this can be called a bug but it's a setting that might not produce the intended results. An issue exist where if extractor.*.skip is true then some files with multiple revisions, such as from kemonoparty and coomerparty, will not be downloaded if extractor.*.archive-format is currently set to the default of "{service}_{user}_{id}_{num}"; which can be checked using the -E option.

How To Reproduce

For the following URL,

https://coomer.su/fansly/user/307507152082186240/post/577611859612409857

we extract session info using:

gallery-dl -s -j https://coomer.su/fansly/user/307507152082186240/post/577611859612409857

If the previously discussed conditions above are set, the object entries with attributes "filename": "577611769514565632_preview" and "filename": "577608964548603905" will both get assigned "num": 1 and thus, only one of these files will be downloaded while the second one in the download order will be skipped since the entry in the sqlite3 archive for both files will be identical due to both files sharing the same num value. Both files generate the following entry in the sqlite3 archive in spite of having different filenames: coomerpartyfansly_307507152082186240_577611859612409857_1.

Workarounds

1) Change the default setting of extractor.*.archive-format to something more unique, like "{service}_{user}_{id}_{filename}_{extension}_{num}". 2) Set extractor.*.skip to false, (which should have the same(?) effect as using the --no-skip option). This will download everything again so not the best solution.

The first option will break legacy support for previous entries already in the sqlite3 archive. Still, if this behavior is indeed unintended, then the first option is probably the best solution.

Other URLs Also Affected

komoreshi commented 1 week ago

Doesn't really address the default config issue since it's dependent on the extractor, but with kemono/coomer, the api returns file hashes (iirc SHA256 is used) and can be used as a more specific way to ensure duplicates aren't downloaded like so: "archive-format": "{subcategory}_{user}_{id}_{num}_{hash}"

orion486 commented 1 week ago

Yes, I originally thought it may affect more sites, but given that these two websites in question seem to be pretty unique in how they provide multiple revisions of a download target, perhaps addressing this issue is better done on a per-website/extractor basis. I am not sure if other websites use a similar revision system but if they do then a similar solution could be used for their extractor, depending on the info that can be extracted.

And I agree, the file hash for this extractor would be a much better solution to ensure no shared entries in sqlite3. I'll make a new PR.

a84r7a3rga76fg commented 1 week ago

"filename": "{hash}.{extension}",
"archive-format": "{subcategory}_{user}_{id}_{hash}"

With these you'll only download unique files. Use Kemono's API to sort the files afterwards. There is literally no point in trying to sort files while downloading from Kemono because of how they handle revisions.

mikf / gallery-dl