Make incremental downloading of new images easier (scraping)

trip54654 commented 6 months ago

I'm running gallery-dl periodically on a list of twitter accounts (and other sites) to download new images. Doing this without wasting a lot of bandwidth (and getting throttled or blocked earlier) is pretty tricky, because gallery-dl doesn't seem to have a native mechanism to support this.

I'm facing the following problems with the following hacks to work them around:

It wants to download all files every time. To avoid having to keep the files in gallery-dl's download directory forever, I'm replacing the downloaded files with 0-sized dummy files. gallery-dl thankfully simply skips them (it thinks they're already downloaded).
To limit how far back the it goes, I use something like --filter (date >= datetime.fromtimestamp(x)) (x replaced with the current UTC time minus the time range, about a week).
One some sites, date is the wrong field. Kemono has an added field in a different format, which I need to special-case and parse. On Kemono, date is the original publication date on the source site, while added is the Kemono upload date which is the correct field for this purpose.
On sites which (probably) give you the images sorted by date, I add or abort() to stop network access completely.
Retweets break the sorted by date assumption so I add or (user["name"] != author["name"]) to the filter.
Pinned tweets break break the sorted by date assumption so I add -o extractor.twitter.pinned=false to the command line.
I add --write-metadata to the command line to recover some information, like actual author for retweets.
After running gallery-dl like this, my script iterates the gallery-dl directory, looks for new images, restores the native filename (filename field from the metadata files), adds the author name to it, moves them to my output directory, and creates a dummy file to prevent gallery-dl from downloading it again.

Shouldn't this be easier? At least making gallery-dl fetch only new images could probably be a builtin feature, instead of having to mess with site-specific filter hacks.

mikf commented 6 months ago

  -A, --abort N               Stop current extractor run after N consecutive
                              file downloads were skipped

trip54654 commented 6 months ago

  -A, --abort N               Stop current extractor run after N consecutive
                              file downloads were skipped

I saw this but I think it's not adequate for this job. It could fail under weird circumstances, like when not downloading retweets and the user retweeted a large number of images.

taskhawk commented 6 months ago

It wants to download all files every time. To avoid having to keep the files in gallery-dl's download directory forever, I'm replacing the downloaded files with 0-sized dummy files. gallery-dl thankfully simply skips them (it thinks they're already downloaded).

Use a download archive instead, --download-archive.

One some sites, date is the wrong field. Kemono has an added field in a different format, which I need to special-case and parse. On Kemono, date is the original publication date on the source site, while added is the Kemono upload date which is the correct field for this purpose.

Use a metadata postprocessor that parses the string and overwrites the original date field instead.

{
    "fix-kemono-date":
    {
      "name": "metadata",
      "mode": "modify",
      "event": "prepare",  
      "fields": {
        "date": "{added[:20]:RT/ /}"
      }
    }
}

This should leave it in the same format date is, although I haven't tested it.

On sites which (probably) give you the images sorted by date, I add or abort() to stop network access completely.

I didn't get this one.

I add --write-metadata to the command line to recover some information, like actual author for retweets.

You can probably use a metadata postprocessor here too to make available most of that metadata as fields so you don't have to go extracting it later.

After running gallery-dl like this, my script iterates the gallery-dl directory, looks for new images, restores the native filename (filename field from the metadata files), adds the author name to it, moves them to my output directory, and creates a dummy file to prevent gallery-dl from downloading it again.

You can use an exec postprocessor for the after event so the script only fires after a media file has been downloaded. Although if you used another postprocessor for the metadata you could have the original filename and author ready to use for the filename config option and be placed directly in your output directory. If you used the archive file the dummy file is no longer needed.

I haven't messed with retweets so I have no clue with those.

Shouldn't this be easier? At least making gallery-dl fetch only new images could probably be a builtin feature, instead of having to mess with site-specific filter hacks.

The skip option already covers that.

trip54654 commented 6 months ago

To summarize what my script does compared to native gallery-dl features:

Dummy files instead of --download-archive. The difference is that I can delete dummy files that are too old, while the --download-file archive will grow forever. That's a small problem, I just didn't know this feature before.
An upload date based --abort. It doesn't stop after a number of skipped (presumably already downloaded) files. Instead it stops when a post is outside of a time window (like 1 week).

I'm not convinced that --abort is the correct thing. But maybe I'm overthinking this? Many of these things in my weird script only exist because some sites and extractors broke assumptions or acted weird.

In theory, --abort 1 will make it stop downloading as soon as it gets to a post that was already downloaded and seen before, right? So in theory, it should download all new files. But this doesn't work if posts are somehow not sorted by upload time, if the target website somehow hides posts and makes them re-appear again later, or when --filter hacks need to skip posts for whatever reasons.

What is the purpose of --abort arguments higher than 1? How does it make sense to use such a value and how do you choose it?

A date based approach also makes sense because you could stop fetching at text-only tweets. Unfortunately this doesn't work with --filter based date checks, but it also doesn't work with --abort. Imagine a user who makes a lot of text tweets, which you would have to fetch every time when checking for new images, because you can stop fetching only at images.

In the rest of the post I'm replying with details, you can skip it if it's too much.

Use a download archive instead, --download-archive.

This is good feature I wasn't aware of yet. But it lacks one thing my script does: it deletes files older than the time window I'm setting with the date filter. The --download-archive file only stores the filenames, no date, so old entries can't be removed. If it saved a date, you could prune old entries with a SQL statement. Maybe a download date column could be added to it?

(I've started my script 4 years ago. I have almost 10000 files saved with it, and these are only the files I manually selected to keep. Yes, the list of files is going to build up.)

Use a metadata postprocessor that parses the string and overwrites the original date field instead.

Interesting, but still a bit clumsy. The hardest part is checking that the date format is correctly parsed.

What I actually want is the following:

It should check all posts within a time window (like all posts up to 1 week ago)
Exit if when it's after the time window to avoid further network accesses (but only if the posts are sorted by date)

In this case, I don't care about the actual post date. But in general it would be awesome if all extractors had an "posted or uploaded at" metadata field that worked the same.

I didn't get this one.

Calling abort() in a --filter is how you make gallery-dl stop downloading and exit. So the idea is that you make it stop downloading as soon as the date is before the time window. Within the time window, it's made to check all posts for robustness.

Some sites don't seem to sort posts by date in all cases. Twitter normally does, except for retweets or pinned posts, so I added exceptions for these cases. (I want the retweets for some users.) For other sites, I gave up trying to use abort(), I don't remember what makes gallery-dl actually stop going through the entire post history.

You can use an exec postprocessor for the after event so the script only fires after a media file has been downloaded.

If I don't need dummy files, I could just copy the entire gallery-dl output directory. The --write-metadata feature is really great and I'm fine with using it. (The only problem is that the metadata sometimes is inconsistent across extractors or too incomplete.)

I haven't messed with retweets so I have no clue with those.

Retweets are like linking an older post made by another user. There are two problems with retweets:

The author can be different and you want to know who the author is (solved by using the metadata file).
Posts are not sorted by the date metadata field anymore. This breaks trying to use the date to stop network accesses.

The skip option already covers that.

I assume you mean --abort?

taskhawk commented 6 months ago

In theory, --abort 1 will make it stop downloading as soon as it gets to a post that was already downloaded and seen before, right? So in theory, it should download all new files. But this doesn't work if posts are somehow not sorted by upload time, if the target website somehow hides posts and makes them re-appear again later, or when --filter hacks need to skip posts for whatever reasons.

Not much you can do about that, as that is a limitation of the site itself. This happens to me with booru and booru-like sites, where files can be tagged with the tags I'm interested in at any moment after I made a full run, and having just abort in future runs would miss those files so I need to make runs with -o skip=true to make sure I get everything because the file could be really old, nothing I can do about it. If the site offers enough data in their results pages most requests are just pagination so it isn't as bad.

What is the purpose of --abort arguments higher than 1? How does it make sense to use such a value and how do you choose it?

Is a middle ground option between aborting right away and continuing until the end, by giving it some headroom to grab new files that may have become available since the last run. If you adjust the abort number considering the number of results for each "page", for example, you're basically saying "look for new files in the last n pages and if nothing new abort, if you find something new also keep looking for n pages more".

I used to use it with Pixiv, but now I just do a normal abort run most days to get new stuff right away and only once in a while I do a complete run to check everything with -o skip=true.

A date based approach also makes sense because you could stop fetching at text-only tweets. Unfortunately this doesn't work with --filter based date checks, but it also doesn't work with --abort. Imagine a user who makes a lot of text tweets, which you would have to fetch every time when checking for new images, because you can stop fetching only at images.

It's as I said above, not much you can do about it if that's what the site offers. With Twitter however you can use the https://twitter.com/user/media URL and/or -o strategy="media" (not sure if they're redundant) to only download the media uploaded by the user without going through text tweets and retweets. For the users that you want their retweeted media too you would have to deal with the normal timeline.

Because I don't deal with text tweets I'm not sure if it's possible but I feel like you should be able to record an entry for them in the archive so gallery-dl knows to stop at the last processed tweet.

At least for Twitter I think you should move away from using dates to stop processing.

This is good feature I wasn't aware of yet. But it lacks one thing my script does: it deletes files older than the time window I'm setting with the date filter. The --download-archive file only stores the filenames, no date, so old entries can't be removed. If it saved a date, you could prune old entries with a SQL statement. Maybe a download date column could be added to it?

That's true, but it's not as bad. My largest archive has 2,247,980 entries and has a filesize of only 163.2 MiB. That's peanuts compared to the 2.1 TiB of corresponding downloaded media. My Twitter archive has 1,589,556 entries and it's only 108.3 MiB against ~844 GiB of media. I don't notice enough of a performance hit either.

You can modify the format of the entry in the archive file with extractor.*.archive-format to include a date or timestamp if you want so you can process it later to delete older entries.

https://gdl-org.github.io/docs/configuration.html

In this case, I don't care about the actual post date. But in general it would be awesome if all extractors had an "posted or uploaded at" metadata field that worked the same.

Yeah, a bit more uniformity on the dates among extractors would be better. That could extend to other shared data fields among extractors too.

I have dealt with that by normalizing the metadata myself with postprocessors (preprocessors? most are triggered before downloading the file).

I assume you mean --abort?

No, the extractor.*.skip option. Can be placed in the configuration file or included in the command with -o skip=value. Check:

https://gdl-org.github.io/docs/configuration.html

You should check it out for more Twitter options, if you didn't know about the archive files, you probably don't know about other Twitter options that could be helpful to you.

trip54654 commented 6 months ago

Very interesting. Unfortunately things seem to be getting more complicated than I thought, instead of easier.

It looks like Writing a script which just periodically fetches new images from an account remains a complicated tasks, at least if you want retweets and try to reduce bandwidth.

trip54654 commented 6 months ago

I just remembered one important case that sucks to use -A with: when the download gets interrupted. You can have downloaded a lot of new files, then crash while but still missing some new files. With -A these files will be missed. (Unless the value passed to it is unreasonably large.) Another argument for making it date based.

reciema commented 4 months ago

Hey OP, hit me up if you ever finished writing that script, I'm trying to do the same thing here and also got stuck at only scraping new images! Lol

trip54654 commented 4 months ago

My script is way too weird and special-cased. It probably doesn't even solve the problem with scraping only new images. It would be better to come up with a way to make this easier to do with gallery-dl directly.

reciema commented 4 months ago

It's crazy that neither gallery-dl itself by default or any of its forks can do such a supposedly simple task... Have you found any alternative ?

trip54654 commented 4 months ago

No. gallery-dl supports many sites, and I'm using some of them (not just twitter). Also coming up with good method to do this isn't as simple as it seemed at first.

Though when nitter still worked, I had my own code which processed its html.

Twi-Hard commented 4 months ago

I made a wrapper script that handles each site I download form differently. For twitter, it checks what the largest twitter id (the most recent one) from my local copy of the user using the username from the url to find the folder then uses that id to add this argument to the command before running it: --filter "tweet_id >= $latest_tweet_id or abort()". That makes it stop at the most recently downloaded twitter id. I have it set to "greater than or equal to" tweet_id but you can just make it just "greater than" to avoid downloading that one file.

Edit: It's better to use "author['name'].lower() == '$handle' and tweet_id >= $latest_tweet_id or abort()" so retweets and stuff don't cause a premature stop. (I set author['name'] to lowercase because I already made the handle lowercase in the script) Edit 2: I've had it commented out because I want to make sure I get all the tweets which it doesn't Edit 3: I really should have said anything :/

trip54654 commented 4 months ago

That makes it stop at the most recently downloaded twitter id.

I suspect this won't work too well. gallery-dl downloads from newest to oldest post. So if it gets interrupted and only some of the new posts got downloaded, you'll miss some posts. (Depending on how you call gallery-dl again.) It's even worse if you want retweets and sticky posts.

This could be improved by

Communicating to the caller whether the latest_tweet_id was actually reached again. Only if that was the case, the caller updates latest_tweet_id. But how to do that? Maybe call sys.exit with a special exit code?
Making the filter always return true if it's a retweet (like your modifications attempts to do, shouldn't it be slightly different?), and if it's a sticky post.

Just brainstorming after not thinking about it a while and only reading your post.

mikf commented 4 months ago

I've just had an idea on how to potentially prevent the issue with --abort when encountering an error midway (https://github.com/mikf/gallery-dl/issues/5255#issuecomment-1975012560): Instead of updating the archive after every downloaded file, store all archive IDs in memory and only write them to disk when all new files where downloaded without error.

trip54654 commented 4 months ago

Instead of updating the archive after every downloaded file, store all archive IDs in memory and only write them to disk when all new files where downloaded without error.

This sounds like a very good idea. Still might break easily with retweets and sticky posts.

mattdawolf commented 4 months ago

hm. This is just what i need. Guessing there's no easy way to tell it last X posts or last X days with no need to keep updating the date range values in the script. :3

mattdawolf commented 4 months ago

Instead of updating the archive after every downloaded file, store all archive IDs in memory and only write them to disk when all new files where downloaded without error.

This sounds like a very good idea. Still might break easily with retweets and sticky posts.

I agree!

mikf / gallery-dl

Make incremental downloading of new images easier (scraping) #5255