Open gaiking-uk opened 1 year ago
did you ever figure this out? Im in the exact same position right now. I have like 15,000 pictures archived on twitter and pixiv and none of them are properly in a database and I want to update/refresh the new posted images lol
Since the archive database entries are basically just {extractorname} {id}
by default, you are basically fine as long as you still have access to the uniquely identifying IDs, like as part of the filename of each downloaded file.
Manually adding DB entries -- I'm guessing this would not normally be recommended but assuming the entries in the SQLite DB are human-readable (i.e. aren't hashes or converted to base64, etc), could the entries be added manually, and so avoiding the need to re-run gallery-dl on dozens of folders, or re-download thousands of files?
Yes, the entries are basically human-readable (since they're just the IDs as used by the site), so they can be added manually, or modified manually, and so on..
[..] I considered [assuming the destination folder path was important] moving the whole folder to a temp location, running gallery-dl and letting it download all 1,100 images (and also importantly building the archive), manually move the new 100 images to the temp folder, then deleting the new gallery and moving the gallery back to it's original folder location
Not sure if I can really follow with the moving back-and-forth you seem to imply here, but the destination folder path etc. is not important when using the archive feature, that is the entire point.
Thanks for the reply. It seems to be in the default format. Lastly, is the only way to update an archive to save the original URL and re-run gallery-dl with it? or is their perhaps a way to construct that URL from the path ./gallery-dl/twitter/vinneart
or have it saved in the sqlite3 database so that gallery-dl can automatically pull in the new content?
I guess I could write a little script that uses the path to reconstruct the URL like the above example would be https://www.twitter.com/vinneart/media
but I fear that it isn't reliable.
Hi @mikf,
Following up on a few earlier comments, this post covers the topic of " using gallery-dl to refresh previously downloaded folders / filtering files downloaded ", with the intention of clarifying your recommended approach for this and and covering some additional considerations and specific notes for implementation -- to hopefully create a 'mini guide' for the benefit of anyone else looking to implement this functionality in the future...
Context
I have been using
gallery-dl
for the past 2 years or so, initially mostly to do a "one-off download" of a gallery but more recently, to do a "refresh" of a previously-downloaded gallery" (i.e. download any new files uploaded since my last download). To date, I've usedgallery-dl
to download 10,000+ files, across 50+ galleries.In terms of previously-downloaded galleries, I would broadly categorise them into 2 groups:
gallery-dl
without any additional parameters should generally work fine as the in-built skip rules should handle this scenario by default.webp
or.png
images were converted to.jpg
etc (and so the original.png
/.webp
files don't exist and won't be detected as existing files)OPTION 1: Use parameter:
--filter "date >= datetime(2023, 4, 15) or abort()"
As discussed in my addendum to https://github.com/mikf/gallery-dl/pull/3284, one idea I had was to add an enhanced
--filter
parameter to thegallery-dl
command, allowing it to filter files by date and also tell gallery-dl to stop running when reaching this point of time in the feed.✅ This option is great for sites like instagram.com where you want to minimise the number of http requests as much as possible ❌ However, readers should note that implementing this would require some additional work, such as:\ [1] setting up and maintaining a list of 'last downloaded dates' for each folder\ [2] dynamically generating and including a
--filter "date >= datetime(Y, m, D, H, M, S) or abort()"
parameterOPTION 2: Use config.json:
extractor.archive
From earlier discussions, I understand that this is your recommended option (of course, please correct me if not) but assuming so, the rest of this post will be on implementing the
extractor.archive
feature...Step 1:
config.json
modificationThe 'extractor.archive' node needs to be populated in
config.json
. As I plan to use the archive feature for all galleries, I'll populate the root extractor node and so add the following lines to myconfig.json
file (please let me know if I've made any mistakes)...Step 2: Populating the archive
NOTE: This step is not required and gallery-dl will work fine without it (especially if you have not downloaded anything with gallery-dl previously). It is an additional quality-improvement step to try and minimise potential file duplication and/or unnecessary downloads.
As mentioned above, before I implemented the archive feature, I had already downloaded several thousand images across a few dozen galleries/folders. Ideally, I want to inform gallery-dl about these files (populate the archive rather than have it build a blank one) to avoid gallery-dl doing things like re-downloading duplicate images that have already removed from the gallery.
Great, so referring the two categories I mentioned earlier ('complete' and 'incomplete/modified'), my understanding is as follows...
COMPLETE galleries
These will be handled fine,
gallery-dl
will do a 'full scan', detect and skip any existing files, and (presumably) also record all the files in the archive so that even if some files are deleted in the future,gallery-dl
won't try and delete these.INCOMPLETE/MODIFIED galleries
These ones may require some extra consideration. Imagine the scenario:
gallery-dl
was run and the 1,000 images were downloaded (without an archive)I'm conscious this post is already LONG so to try and summarise, re-running
gallery-dl
right now would not be good! and would in effect download not only the 100 new images, but also the 'missing' 500 duplicates that I manually removed.To try and avoid this, I considered how to try and populate the archive with the existing images and would appreciate your advise...
--no-download
parameter -- I could try this (in combination with a reverse date filter i.e. only looking at images from the download date or earlier), would this letgallery-dl
build the archive of the 1,000 original images?gallery-dl
and letting it download all 1,100 images (and also importantly building the archive), manually move the new 100 images to the temp folder, then deleting the new gallery and moving the gallery back to it's original folder locationgallery-dl
on dozens of folders, or re-download thousands of files?CONSIDERATIONS
Lastly, I wanted to briefly include any other considerations that future readers should bear in mind.
The main one I can think of is how the file references are stored in the SQLite3 DB, i.e. if it records links to say the source URL or the path to the downloaded file, etc... The reason for asking is to understand if something like renaming the gallery folder would then invalidate the links in the archive and
gallery-dl
would try to download the files again?I recognise that you are the master in this area, so if there is anything else you want to add, or any comments/corrections to the above you want to make, please feel free!
Once again, thanks in advance for your help! 👍🏼