internetarchive / brozzler

brozzler - distributed browser-based web crawler
Apache License 2.0
669 stars 97 forks source link

Videos on Twitter captures #198

Open laurelin88 opened 4 years ago

laurelin88 commented 4 years ago

Hi,

Before I describe the issue, I will preface by saying I am very new to brozzler and similar tools in general, so perhaps my question is a little bit simplistic. Anyways, I was wondering if you have any pointers as to why videos on some hashtag feeds I captured do not seem to play when I view them on pywb. Is there any configuration I could make to possibly solve this, for Twitter or for other social media platforms/websites? Thank you!

galgeek commented 4 years ago

brozzler depends on youtube-dl for much video capture, so make sure that your youtube-dl install is up to date (it's updated pretty frequently). You can update your brozzler virtualenv with pip install -U youtube-dl.

twitter has recently updated video and hashtag code, and we're actively working on improving capture.

laurelin88 commented 4 years ago

Thank you for your response and apologies for my late reply - I did indeed update youtube-dl but the problem seems to persist. In fact, after checking the WARC file with the ArchiveTools warc-extractor, it turns out that the file dump from the WARC does contain TS video files that are accessible, but they are not replayable from inside the WARC itself, e.g. with pywb.

galgeek commented 4 years ago

What's the twitter url you're trying to capture?

anjackson commented 4 years ago

I'll gladly be corrected on this, but AFAIK right now there is no openly-available web archive playback system that can play the videos captured in this way.

The pywb stack massages the messages between the client and server so that playback works without additional metadata records. The approach used here (capturing the videos with youtube-dl and storing a JSON metadata record that links the source page to the videos) requires an additional step which is only just now being finalised in pywb, and will require a little more work at the indexing stage to make it work (mapping metdata:... records to urn:embeds:...). /cc @ikreymer

galgeek commented 4 years ago

brozzler with youtube-dl currently captures mp4s for at least some twitter video. (Whether and when youtube-dl captures video from a site can depend on the format of the upload, as well as the site's video hosting pipeline.) Here's one that I worked on recently: https://wayback.qa-archive-it.org/12058/20200601214942/https://video.twimg.com/ext_tw_video/1056575394453839872/pu/vid/1280x720/dhWsaVXAvomMyBG-.mp4?tag=5

brozzler often directly captures the initial segments of media that's delivered in segments, which may be how @laurelin88's TS video files were captured. It's true that these are a challenge to replay.

ikreymer commented 4 years ago

Yes, as @anjackson mentions, with the youtube-dl approach, it would be possible to read the youtube-dl JSON to determine the URLs of videos download via youtube-dl. pywb is now supporting urn:embeds: for a more generic JSON embeds format, while brozzler is saving the youtube-dl files as youtube-dl:<id>:<url>.

It would be possible to support youtube-dl:... lookup in pywb also, but then will still need to determine where the video should go on the page.. if more than one video, it's may not be possible to guess.., so playback of video may not always work..

Fortunately, there is also alternative solution, and pywb has been supporting for a while, with HTML5 video, and does not involve youtube-dl at all.

When encountering an HLS or DASH manifest, it is possible to rewrite it at capture time, so that only one resolution is available.. For example, given an HLS manifest (.m3u8) file that looks like:

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=610000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_1.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=416000,RESOLUTION=400x224,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_2.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=797000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_3.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1002000,RESOLUTION=640x360,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_4.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=4495000,RESOLUTION=1920x1080,CODECS="avc1.640028, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_6.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=38000,CODECS="mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/audio_0.m3u8

The rewriting simply removes all resolutions except desired one to be captured. The desired resolution can be highest up to a max, so maybe 1920x1080 is too much, and the second-highest one is chosen. The file served to the browser than looks like this (while the original is still written to WARC):

#EXTM3U
#EXT-X-MEDIA:TYPE=SUBTITLES,GROUP-ID="WebVTT",NAME="English",DEFAULT=YES,AUTOSELECT=YES,FORCED=NO,URI="https://example.com/subtitles/"
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2505000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",SUBTITLES="WebVTT"
http://example.com/video_5.m3u8

Then, on replay, if the same rewriting is applied, all the chunks of the video will play back, as only one set resolution is available. This makes the videos on Twitter (and many other sites that use HLS) will work. DASH is a similar XML based format that allows for this type of filtering.

imo this little bit of rewriting/filtering at capture time is a useful tradeoff as it avoids all the complexity of youtube-dl, extra index, and replay-time video index mapping, and results in working video replay. main downside is the video is archived in chunks, rather than as one record/stream.