Guide: Options and recommendations for refreshing existing downloads

gaiking-uk commented 1 year ago

Hi @mikf,

Following up on a few earlier comments, this post covers the topic of " using gallery-dl to refresh previously downloaded folders / filtering files downloaded ", with the intention of clarifying your recommended approach for this and and covering some additional considerations and specific notes for implementation -- to hopefully create a 'mini guide' for the benefit of anyone else looking to implement this functionality in the future...

Context

I have been using gallery-dl for the past 2 years or so, initially mostly to do a "one-off download" of a gallery but more recently, to do a "refresh" of a previously-downloaded gallery" (i.e. download any new files uploaded since my last download). To date, I've used gallery-dl to download 10,000+ files, across 50+ galleries.

In terms of previously-downloaded galleries, I would broadly categorise them into 2 groups:

COMPLETE -- All the files that were downloaded previously still exist in their original format and name, so re-running gallery-dl without any additional parameters should generally work fine as the in-built skip rules should handle this scenario by default
INCOMPLETE / MODIFIED -- After they were originally downloaded, some files were removed or renamed (e.g. duplicate files were deleted, or .webp or .png images were converted to .jpg etc (and so the original .png / .webp files don't exist and won't be detected as existing files)

OPTION 1: Use parameter: `--filter "date >= datetime(2023, 4, 15) or abort()"`

As discussed in my addendum to https://github.com/mikf/gallery-dl/pull/3284, one idea I had was to add an enhanced --filter parameter to the gallery-dl command, allowing it to filter files by date and also tell gallery-dl to stop running when reaching this point of time in the feed.

✅ This option is great for sites like instagram.com where you want to minimise the number of http requests as much as possible ❌ However, readers should note that implementing this would require some additional work, such as:\ [1] setting up and maintaining a list of 'last downloaded dates' for each folder\ [2] dynamically generating and including a --filter "date >= datetime(Y, m, D, H, M, S) or abort()" parameter

OPTION 2: Use config.json: `extractor.archive`

From earlier discussions, I understand that this is your recommended option (of course, please correct me if not) but assuming so, the rest of this post will be on implementing the extractor.archive feature...

Step 1: `config.json` modification

The 'extractor.archive' node needs to be populated in config.json. As I plan to use the archive feature for all galleries, I'll populate the root extractor node and so add the following lines to my config.json file (please let me know if I've made any mistakes)...

// config.json
// -----------

{
    "extractor": {
        "#":"If set, a SQLite3 DB is created and used to store a list of all files downloaded, any file listed in the archive will be ignored and gallery-dl will not try to download it again. This process is automatic and does not require any additional options in the included in the gallery-dl command.",
        "#":"Assumption: paths are relative to the 'extractor.base-directory'",
        "archive": "./gallery-dl master download archive.sqlite3"
    }
}

Step 2: Populating the archive

NOTE: This step is not required and gallery-dl will work fine without it (especially if you have not downloaded anything with gallery-dl previously). It is an additional quality-improvement step to try and minimise potential file duplication and/or unnecessary downloads.

As mentioned above, before I implemented the archive feature, I had already downloaded several thousand images across a few dozen galleries/folders. Ideally, I want to inform gallery-dl about these files (populate the archive rather than have it build a blank one) to avoid gallery-dl doing things like re-downloading duplicate images that have already removed from the gallery.

Note: Archive files that do not already exist get generated automatically. Source: configuration.rst#extractorarchive

Great, so referring the two categories I mentioned earlier ('complete' and 'incomplete/modified'), my understanding is as follows...

COMPLETE galleries

These will be handled fine, gallery-dl will do a 'full scan', detect and skip any existing files, and (presumably) also record all the files in the archive so that even if some files are deleted in the future, gallery-dl won't try and delete these.

INCOMPLETE/MODIFIED galleries

These ones may require some extra consideration. Imagine the scenario:

6 months ago, a gallery had 1,000 images in it
gallery-dl was run and the 1,000 images were downloaded (without an archive)
500 images were then manually deleted (e.g. removing duplicates, etc)
Today, that gallery has 100 new files / 1,100 in total
The goal is to download only the new 100 (but again there was no archive was setup initially)

I'm conscious this post is already LONG so to try and summarise, re-running gallery-dl right now would not be good! and would in effect download not only the 100 new images, but also the 'missing' 500 duplicates that I manually removed.

To try and avoid this, I considered how to try and populate the archive with the existing images and would appreciate your advise...

Add the --no-download parameter -- I could try this (in combination with a reverse date filter i.e. only looking at images from the download date or earlier), would this let gallery-dl build the archive of the 1,000 original images?
Move aside, run, move back -- If not / alternatively, I considered [assuming the destination folder path was important] moving the whole folder to a temp location, running gallery-dl and letting it download all 1,100 images (and also importantly building the archive), manually move the new 100 images to the temp folder, then deleting the new gallery and moving the gallery back to it's original folder location
Manually adding DB entries -- I'm guessing this would not normally be recommended but assuming the entries in the SQLite DB are human-readable (i.e. aren't hashes or converted to base64, etc), could the entries be added manually, and so avoiding the need to re-run gallery-dl on dozens of folders, or re-download thousands of files?
Any other/better options? -- If there's another/better way of doing this that I've missed, please let me know!

CONSIDERATIONS

Lastly, I wanted to briefly include any other considerations that future readers should bear in mind.

The main one I can think of is how the file references are stored in the SQLite3 DB, i.e. if it records links to say the source URL or the path to the downloaded file, etc... The reason for asking is to understand if something like renaming the gallery folder would then invalidate the links in the archive and gallery-dl would try to download the files again?

I recognise that you are the master in this area, so if there is anything else you want to add, or any comments/corrections to the above you want to make, please feel free!

Once again, thanks in advance for your help! 👍🏼

sweetbbak commented 1 month ago

did you ever figure this out? Im in the exact same position right now. I have like 15,000 pictures archived on twitter and pixiv and none of them are properly in a database and I want to update/refresh the new posted images lol

Hrxn commented 1 month ago

Since the archive database entries are basically just {extractorname} {id} by default, you are basically fine as long as you still have access to the uniquely identifying IDs, like as part of the filename of each downloaded file.

Hrxn commented 1 month ago

Manually adding DB entries -- I'm guessing this would not normally be recommended but assuming the entries in the SQLite DB are human-readable (i.e. aren't hashes or converted to base64, etc), could the entries be added manually, and so avoiding the need to re-run gallery-dl on dozens of folders, or re-download thousands of files?

Yes, the entries are basically human-readable (since they're just the IDs as used by the site), so they can be added manually, or modified manually, and so on..

[..] I considered [assuming the destination folder path was important] moving the whole folder to a temp location, running gallery-dl and letting it download all 1,100 images (and also importantly building the archive), manually move the new 100 images to the temp folder, then deleting the new gallery and moving the gallery back to it's original folder location

Not sure if I can really follow with the moving back-and-forth you seem to imply here, but the destination folder path etc. is not important when using the archive feature, that is the entire point.

sweetbbak commented 1 month ago

Thanks for the reply. It seems to be in the default format. Lastly, is the only way to update an archive to save the original URL and re-run gallery-dl with it? or is their perhaps a way to construct that URL from the path ./gallery-dl/twitter/vinneart or have it saved in the sqlite3 database so that gallery-dl can automatically pull in the new content?

I guess I could write a little script that uses the path to reconstruct the URL like the above example would be https://www.twitter.com/vinneart/media but I fear that it isn't reliable.

mikf / gallery-dl