stockbsd / twitter-media-dl

download twitter media (photos and videos)
MIT License
70 stars 21 forks source link

Not getting all images #2

Open Honowski opened 4 years ago

Honowski commented 4 years ago

I've noticed this only scrapes about half of the total images on a person's twitter. It grabs all the videos I believe.

MonoS commented 4 years ago

I second that request, I tried with a couple of users and this library download only the most recent image (i use JD to count the number of media btw). I've also tries to lift the --limit to, for example, 20000, but with no avail.

Modifying a bit the code adding some loging, i think that the issue is on twitter side as the program will reach line 126 and exit.

stockbsd commented 4 years ago

Twitter has some limits on his APIs , for example https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline

If we try to get up to 20000 tweets at once, maybe we will reach the limit.

MonoS commented 4 years ago

The problem is a bit different as i'm not hammering the API. For example i want to download media for this user, i launch the program and it download some of the media (55 to be precise), i expected them to be the most recent ones, but if i download the user data using JDownloader, since the same id, it download some other media (114), so some of them are missing. Am I right in assuming that the ID is monotonically incrementing for all the twitter users? If so, if downloading older media is not possible due to some api limit, shouldn't both program download the same number of media?

stockbsd commented 4 years ago

exclude_replies/include_rts in the API parameters may cause the difference.

MonoS commented 4 years ago

The option "Force grab media" in JDownloader is enabled (so it won't crawl "retweets and other content from users' timelines") and i don't use the -rts flag, I've also re-checked some of the missing media and some of them are indeed replies, but some others aren't.

stockbsd commented 4 years ago

"reply tweet" or "retweet with comment " can cause difference if there is media in the reply and original one. can you give the user id or tweet id with missing media ?

MonoS commented 4 years ago

Sure. The user id is AngelOfGears Regarding the tweet id twitter-dl.txt JDownloader.txt Those are all the ids that JD and twitter-dl returned, i've sorted them and then truncated the jd's one so that both list started at the same id.

Let me know if you need more information and thank you for your time :)

stockbsd commented 4 years ago
  1. twitter-dl.txt has several duplicate ids. loop it and twitter-dl --rts --video , download 55 files and skipping 60 times.
  2. twitter-dl --video --rts -v -l 200 AngelOfGears . , download 157 files.

can you check if JD produce different file names with same content ?

MonoS commented 4 years ago

JD did download files with the same content but from different tweet ids, for example 1250701846479544320 1246422948057153536, while twitter-dl downloaded only the latter. Also i don't understand why have you added the --rts flag, JD does not download retweet (unless you uncheck an option in the settings).