taspinar / twitterscraper

Scrape Twitter for Tweets
MIT License
2.39k stars 581 forks source link

Fix video url scraping #285

Open makamys opened 4 years ago

makamys commented 4 years ago

The HTML element that the video url was getting scraped no longer exists, so video_div.find('a') returned None, and this made tweets containing videos fail getting scraped. I changed it to use regex to extract the video id, and construct the video url from it.

someguy-2020 commented 4 years ago

I had to change line 83 to: video_id = re.search(r"https://pbs.twimg.com/ext_tw_video_thumb/(.*)\.jpg", str(video_div)).group(1) [tweet_video_thumb --> ext_tw_video_thumb] to get the proper video image URL. Unfortunately, this doesn't provide the proper video_url. Any idea what the video_url is based on the video img url?

makamys commented 4 years ago

Oh dang, it looks like it wasn't as simple as I was hoping. It turns out short videos have the thumbnail image in a format like tweet_video_thumb/<VIDEO ID>.jpg, and for those, my code works.

But longer videos are in the format of ext_tw_video_thumb/<TWEET ID>/pu/img/<THUMBNAIL ID>.jpg like you posted. Those videos are streamed via HLS, and the web app makes an API call (https://api.twitter.com/1.1/videos/tweet/config/<TWEET ID>.json) to find the m3u8 that contains the segments (which is in the form of https://video.twimg.com/ext_tw_video/<TWEET_ID>/pu/pl/<VIDEO ID>.m3u8).

Using <THUMBNAIL ID> as the <VIDEO ID> doesn't work though, and there's no reference to the <VIDEO ID> in the html served. So there may not be a way to get the video url without making an API call.

By the way, youtube-dl uses the API with a guest token to get the video url (see twitter.py, relevant discussion here).


As a workaround, the video url could be set to the tweet's url so at least tweets with videos don't get skipped. My use case for twitterscraper didn't include scraping tweets with long videos though, so I won't be fixing this myself, but hopefully these notes will be useful to someone else.