Open makamys opened 4 years ago
I had to change line 83 to: video_id = re.search(r"https://pbs.twimg.com/ext_tw_video_thumb/(.*)\.jpg", str(video_div)).group(1) [tweet_video_thumb --> ext_tw_video_thumb] to get the proper video image URL. Unfortunately, this doesn't provide the proper video_url. Any idea what the video_url is based on the video img url?
Oh dang, it looks like it wasn't as simple as I was hoping. It turns out short videos have the thumbnail image in a format like tweet_video_thumb/<VIDEO ID>.jpg
, and for those, my code works.
But longer videos are in the format of ext_tw_video_thumb/<TWEET ID>/pu/img/<THUMBNAIL ID>.jpg
like you posted. Those videos are streamed via HLS, and the web app makes an API call (https://api.twitter.com/1.1/videos/tweet/config/<TWEET ID>.json
) to find the m3u8 that contains the segments (which is in the form of https://video.twimg.com/ext_tw_video/<TWEET_ID>/pu/pl/<VIDEO ID>.m3u8
).
Using <THUMBNAIL ID>
as the <VIDEO ID>
doesn't work though, and there's no reference to the <VIDEO ID>
in the html served. So there may not be a way to get the video url without making an API call.
By the way, youtube-dl uses the API with a guest token to get the video url (see twitter.py, relevant discussion here).
As a workaround, the video url could be set to the tweet's url so at least tweets with videos don't get skipped. My use case for twitterscraper didn't include scraping tweets with long videos though, so I won't be fixing this myself, but hopefully these notes will be useful to someone else.
The HTML element that the video url was getting scraped no longer exists, so
video_div.find('a')
returnedNone
, and this made tweets containing videos fail getting scraped. I changed it to use regex to extract the video id, and construct the video url from it.