mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
10.72k stars 882 forks source link

Twitter giving frequent rate limits #3557

Open Twi-Hard opened 1 year ago

Twi-Hard commented 1 year ago

Ever since the twitter extractor was fixed after it broke I've been getting frequent rate limits. This doesn't happen with snscrape. I haven't tested other scrapers. I used to run 10+ instances of gallery-dl at a time, very fast and without rate limiting. snscrape scrapes faster as usual (probably because it isn't creating a ton of files like gallery-dl) and doesn't get rate limited. Is there something that can be done to fix this? I've tried 2 different accounts with username and password but that didn't fix the issue. Thanks :)

ClosedPort22 commented 1 year ago

Please do a test run using --verbose --ignore-config and post the log file.

mikf commented 1 year ago

I think this is because of cached guest tokens and Twitter reducing the rate limit for searches to 350 per 15m.

Twitter rate limits are bound to a guest token or account, and gallery-dl reuses the same guest token for up to one hour, even across multiple gallery-dl instances. snscrape on the other hand requests a new token each time it is run.

You can prevent guest token reuse by disabling gallery-dl's cache: -o cache.file=

Twi-Hard commented 1 year ago

Disabling the cache fixes it if I also disable my username and password but I still get rate limited when I'm logged in. I need to be logged in as a huge amount of content I'm trying to get is NSFW (I'm not focused on NSFW, but it's still really common for many accounts). Is there anything I can do about this? I hope the many logins isn't a concern (I tried it with a concurrency of 10 to test it) image

mikf commented 1 year ago

I need to be logged in

Then there is nothing that can be done. I'm afraid, or at least nothing that I'm aware of.

When you are logged in, you have a rate limit separate from any guest tokens at also 350 requests every 15 minutes, and it applies to all requests that your account sends.

Sending a guest token together with your login cookies, which gallery-dl currently does not do - either token when logged out or cookies, does not help either. In this case Twitter still uses your account's rate limit and ignores the token.

You might be able to use the syndication API while not logged in, if that's an option for you.


snscrape doesn't support login/cookies for Twitter, does it?

Twi-Hard commented 1 year ago

The snscrape dev has made it very clear he'll never add support for authentication (source) The reason I switched to gallery-dl for twitter was because I was missing too many tweets because of the lack of authentication (which made me find many other good reasons to use gallery-dl instead too).

Perhaps there's a way to only search age restricted tweets that I can do logged in after the rest of the download?

How well would the syndication api work for me? Would I still get every tweet I would have if I was logged in and is the metadata much different? Metadata is really important to me.

rautamiekka commented 1 year ago

I've used

        "twitter": {
            "sleep": 0.5,
            "sleep-request": 0.5
        },

together with a dummy acc for a few weeks by now: nowhere nearly as much rate-limiting after they limited the request count, which

##SFW.
gallery-dl -v 'https://twitter.com/MidPrem' 'https://twitter.com/MidPrem/media'

alone always got a couple times (I think) even with the archive file.

I chose 0.5 totally arbitrarily and is most likely overkill, but we'll see when I can bother to start testing and crunching numbers.

Twi-Hard commented 1 year ago

I have way too many accounts to download for only 1 instance of the downloader to ever get through them. I usually have 10 running at once. Adding a 0.5 second delay wouldn't fix it for me.

rautamiekka commented 1 year ago

Yeah, your use case is too extreme for simple delays. Only now realized you were the OP, to boot.

ClosedPort22 commented 1 year ago

Would I still get every tweet I would have if I was logged in

Probably. As long as Twitter returns the IDs of age-restricted tweets there would be no difference.

is the metadata much different? Metadata is really important to me.

The only difference I'd noticed was the metadata for users. I implemented the syndication=extended option (#3483) to specifically solve this problem.

The caveat with the syndication API is that it needs to be called for each age-restricted tweet, so you'll probably going to run into rate limits as well.

I have way too many accounts to download for only 1 instance of the downloader to ever get through them.

There's always the option of investing in a Raspberry Pi and letting your download jobs run 24/7. I don't have a lot of accounts to download, so I don't care if I have to set a 10 sec delay and let it run for several days.

KonoVitoDa commented 11 months ago

I think this is because of cached guest tokens and Twitter reducing the rate limit for searches to 350 per 15m.

Was the rate limit reduced even more? I'm being able to download only 50 posts each 15 minutes. I'm using an input-file with a bunch of links.

Kavolc commented 10 months ago

Was the rate limit reduced even more? I'm being able to download only 50 posts each 15 minutes. I'm using an input-file with a bunch of links.

Same here, I tried with an old and new account, I can only download/see 50 post every 15 minutes