mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
10.66k stars 880 forks source link

Twitter how to continue aborted download from where it stopped #5753

Open rexon113 opened 1 week ago

rexon113 commented 1 week ago

Currently I'm trying to scrape various Twitter accounts. There is an input file, consisting of Twitter account urls. Gallery-dl takes these and downloads them. There is also an archive file being used. A cookies.txt file is being used.

"archive": "F:/{category}",

"twitter":
        {
            "cards": false,
            "conversations": false,
            "pinned": false,
            "quoted": false,
            "replies": true,
            "retweets": false,
            "strategy": null,
            "text-tweets": false,
            "twitpic": false,
            "unique": true,
            "users": "user",
            "videos": true,
            "filename": "{date:%Y%m%d}-{tweet_id}-{num}.{extension}",
            "directory": ["{category}", "{author[name]}"] ,
            "sleep-request": 2,
            "sleep": 2
        }
"logfile": {
            "mode": "w",
            "format": {
                "debug"  : "[{asctime}][{levelname}][{name}] {message}",
                                "info"   : "[{asctime}][{levelname}][{name}]  {message}",
                                "warning": "[{asctime}][{levelname}][{name}]  {message} [Source URL: {extractor.url}]",
                                "error"  : "[{asctime}][{levelname}][{name}]  {message} [Source URL: {extractor.url}]"
            },
            "format-date": "%Y-%m-%d-%H-%M-%S"
        },

Command: gallery-dl -i profileurls.txt --write-log twitterlog.txt -v --cookies cookies.txt --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"

The problem is, sometimes gallery-dl gets logged out and I see no way in continuing the download from where it stopped without rerunning the whole thing. This seems to happen especially when trying to grab really really old tweets.

The problem is that so far I haven't found a way to continue downloading where it is. Sure, because of the archive file restarting gallery-dl will skip the already downloaded images. But gallery-dl will still do the GET requests of searching for the tweets. So what often happens is that gallery-dl gets logged out, I restart gallery-dl, it skips the already downloaded files because of the archive file (denoted with # in the log), but then it gets logged out again. So I'm constantly stuck on the same account.

What I've done so far, to try solve this problem is to shuffle the urls randomly each time I feed a new cookies.txt file. This approach is not really intelligent, but it works. However gallery-dl still gets stuck sometimes obviously on a specific account. I need to find a way to continue downloading a Twitter account if gallery-dl got logged out. Instagram has a cursor, however for Twitter I cannot find a cursor (even though there is one in the code?). Another option is using the date range option to limit the time. Say, gallery-dl got stuck at a tweet from 2016, then continue downloading only from 2016 earlier. That works, but is a bit tiresome.

Is there an easier approach in continuing a Twitter download for a specific user if gallery-dl got aborted at some time, especially if you feed a list of urls? If not, I am content in finding an approach just for a single URL

mikf commented 6 days ago

For input files, you can use -I/--input-file-comment or -x/--input-file-delete to mark URLs as completed. Maybe make a copy of the original file first.

For a specifc user, there's nothing else except whet you already mentioned. Cursor support would be nice, but it would need a bit more work than IG.

Regarding the getting logged out issue, there is relogin but you'd need to use username & password to login, which might prompt for extra input (tfa, email code, etc).

julianw1010 commented 6 days ago

So far, I have tried -I and -x before but found them to be inconsistent, especially regarding errors and warnings. Sometimes lines would not get commented out/removed when they should have, e.g. because of a single warning/error somewhere in the process. This happens for example when a rate limit on Instagram occurs (does gallery-dl throw a different exit code when a rate limit occurs, even though it didn't hinder the process because e.g. gallery-dl just waits?)

Or sometimes, all subsequent lines would be commented out/removed if, for example, Twitter got logged out and then every url throws an error.

I am aware that editing input files goes beyond the purpose of gallery-dl or a file scraper in general, as discussed in #4732 . And I've also read the warning that -I and -x might be buggy. A copy of the input file is always a good idea when modifying the input. The mentioned python script in #4732 works quite well when considering exit codes, although I didn't quite figure the exit codes out yet and it seems to be too sensitive to warnings/errors like ratelimits, showing the same behaviour as described above with -I/-x.

Regarding relogin/login, cookies seem to work way better than login fields so that's what I'm using almost exclusively. Most of the times with login fields, I get issues with 2fa or ratelimits/logouts occur way quicker