[FR] Twitter API support

mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites

GNU General Public License v2.0

11.93k stars 975 forks source link

[FR] Twitter API support #980

Open God-damnit-all opened 4 years ago

God-damnit-all commented 4 years ago

Twitter has made it increasingly difficult to scrape tweets through their web API, and have been putting in an increasing number of checks to try and verify the user is running a browser. Not to mention, their web design receives frequent changes, usually invisibly, but sometimes full-on revamps.. . and certainly more than most services certainly do.

Right now, gallery-dl has no way of downloading media from retweets, regardless of whether or not the option for it is enabled. Additionally, the /media URL will retrieve files the standard user URL misses - and there still might not be files it's grabbing.

Because you can retrieve an entire user's timeline with just a few API calls, all it takes is parsing the JSON return for all media URLs and downloading them directly, and this method would not break any time soon.

Additionally, it would be able to retrieve tweets that can only be seen by 'approved followers', assuming the user has approved the account associated with the API key.

Lastly, and perhaps most importantly, this would allow the retrieval of users by ID rather than their current screen name, meaning that even if the user you want to download media from changes their screen name in the future, so long as you're retrieving their newest stuff by ID, gallery-dl would find it.

mikf commented 4 years ago

Right now, gallery-dl has no way of downloading media from retweets, regardless of whether or not the option for it is enabled

It does, but only from the regular timeline (twitter.com/USER) because the media timeline doesn't include any retweets.

Because you can retrieve an entire user's timeline with just a few API calls

The official API also has limits (~3000 tweets per timeline). You can/have to use Twitter's (advanced) search to get more than the web/official API allows.

and this method would not break any time soon.

That would be the only reason for using the official API, but they've only just changed their web API after it being "stable" for several years, and I doubt they are going to that again in the near future.

Additionally, it would be able to retrieve tweets that can only be seen by 'approved followers', assuming the user has approved the account associated with the API key.

You can do that with the current implementation. You just have to login or use exported cookies.

Lastly, and perhaps most importantly, this would allow the retrieval of users by ID rather than their current screen name

That can theoretically be done with the web API as well. Currently each screen name gets mapped to its ID with a /UserByScreenName API endpoint before the ID gets used in other API calls, and it would be quite easy to support URLs like https://twitter.com/intent/user?user_id=… as well.

God-damnit-all commented 4 years ago

The official API also has limits (~3000 tweets per timeline). You can/have to use Twitter's (advanced) search to get more than the web/official API allows.

Are you saying the official API is limited to ~3000 most recent tweets, or that it will only return ~3000 tweets per call?

Because you can do API calls to retrieve tweets before a certain ID to essentially chain them together, unless it literally won't retrieve any tweet past the most recent 3000 even with that parameter.

God-damnit-all commented 4 years ago

Currently each screen name gets mapped to its ID with a /UserByScreenName API endpoint before the ID gets used in other API calls, and it would be quite easy to support URLs like https://twitter.com/intent/user?user_id=… as well.

This would definitely help.

brachna commented 4 years ago

Right now, gallery-dl has no way of downloading media from retweets, regardless of whether or not the option for it is enabled. Additionally, the /media URL will retrieve files the standard user URL misses - and there still might not be files it's grabbing.

I noticed that some retweets indeed don't hold "media" in them. Turns out that this code in pagination was the culprit:

if "retweeted_status_id_str" in tweet:
    retweet = tweets.get(tweet["retweeted_status_id_str"])
    if retweet:
        tweet["author"] = users[retweet["user_id_str"]]
    yield tweet

retweet is the one that holds both media and full text, so it should be used instead. Besides retweet holds original date of posting.

mikf commented 4 years ago

From https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline:

This method can only return up to 3,200 of a user's most recent Tweets. Native retweets of other statuses by the user is included in this total, regardless of whether include_rts is set to false when requesting this resource.

since_id | optional | Returns results with an ID greater than (that is, more recent than) the specified ID. There are limits to the number of Tweets that can be accessed through the API. If the limit of Tweets has occurred since the since_id, the since_id will be forced to the oldest ID available.

Pretty sure the official API doesn't allow you to retrieve any tweets older than the newest 3200 for each timeline.

There have been other issues asking/discussing on how to get all tweets from a user's timeline, but nobody found a definite answer: #186 #544

If getting all tweets were as easy as using the official API, it would have already been used a long time ago.

God-damnit-all commented 4 years ago

@mikf I found a use case for the API (unfortunately). For some reason, even with include:nativeretweets, the search function misses a lot of retweets.

Now, while you can just scrape the account normally, you lose out on the ability to start from a specific tweet (i.e. since_id). When you use the since_id via the API search, it retrieves those retweets.

Right now my strategy is to build up a user's archive via search queries until present day, and then simply start searching from the last tweet id my script recorded, but doing it via the web search misses lots of recent retweets.

The next best thing is to set it to abort once so many files are skipped, but if a user happens to have a lot of retweets from an account I already have a lot of images from (since I sort them in directory by id), it'll reach the limit erroneously... right now I'm just setting the limit to a particularly high number, like 100.

Twitter is such a wonderful platform.

God-damnit-all commented 4 years ago

I realize I'm not explaining this in the most coherent way. Let me describe my process.

Step 1 is accumulating a user's tweets via searches from their join date in 2 week intervals. This works fine, but it doesn't retrieve all retweets.

Step 2 is running gallery-dl on their user page to grab all the missed retweets from their most recent ~3200 tweets. (I could probably switch Step 1 and Step 2 in some manner, so that it starts searching backward from the last tweet it could grab, but this just seems more foolproof.)

Step 3 is, from that point forward for that user, using a separate config set to abort when it hits a 40 'file already exists' limit. I can't afford to set it too low, because if it grabs retweets I already have from scraping a different user's account, it'll abort prematurely, before I've grabbed all the new content.

What I would like is for my Step 3 to switch to the API and have my script simply start grabbing from the last recorded tweet ID in the directory. This would not miss the retweets that the web search's since_id parameter does.