openzim / warc2zim

Command line tool to convert a file in the WARC format to a file in the ZIM format
https://pypi.org/project/warc2zim/
GNU General Public License v3.0
44 stars 4 forks source link

Youtube video not displaying placeholder image and not playing #262

Closed kelson42 closed 4 months ago

kelson42 commented 4 months ago

At https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-05/mesquartierschinois.wordpress.com/

video poster is not displayed (black video) and playing video foes not work

benoit74 commented 4 months ago

Impact is wider than only FF on iOS 17.5, it is just broken

I ran again a new crawl and conversion and this times, it is playing on FF on macOS 12.7.4, we just miss the placeholder image on few Youtube videos (due to what looks like obscure Youtube player decisions regarding which image resolution should be used)

benoit74 commented 4 months ago

I did a small test run of zimit/warc2zim on only the homepage of mesquartierschinois.

First thing, https://replayweb.page/ suffers from the same issue to display video thumbnail / placeholder, so it is not a bug "by-design" in warc2zim, more a problem of Youtube player dynamic behavior.

Analyzing a bit Youtube player behavior, it looks like all thumbnails/placeholders for videos are served from i.ytimg.com domain.

WARC records from this domain present in the test WARC are:

https://i.ytimg.com/vi/3CAmD8BAqG0/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AHUBoAC4AOKAgwIABABGHIgQyg5MA8=&rs=AOn4CLBPP2EMoCMYfmxlew67Rb4WEd4cLA
https://i.ytimg.com/vi/fCAhUkOePf4/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLDqOaFbpTDgBFW8xQ9-eKTHkabhjA
https://i.ytimg.com/vi/1T3u9jjXcGM/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLAZc_9BWYzjlZe4SAEnSYoMa6Jfgg
https://i.ytimg.com/vi/-KpLmsAR23I/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g
https://i.ytimg.com/vi/fCAhUkOePf4/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLC-m8tFGt-eBQK-6dknusK6ah0U2w
https://i.ytimg.com/vi/Q6R71SRnGwY/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgYChAMA8=&rs=AOn4CLBSWbfLzJgeRNI6KWqgMieqd3wumw
https://i.ytimg.com/vi/1T3u9jjXcGM/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLCSrEzSzni7p_UdF9uieKu1XDl_fQ

Thumbnail/placeholder images present inside the ZIM for this domain are in accordance with WARC content:

i.ytimg.com/vi/-KpLmsAR23I/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g
i.ytimg.com/vi/1T3u9jjXcGM/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLAZc_9BWYzjlZe4SAEnSYoMa6Jfgg
i.ytimg.com/vi/1T3u9jjXcGM/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLCSrEzSzni7p_UdF9uieKu1XDl_fQ
i.ytimg.com/vi/3CAmD8BAqG0/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AHUBoAC4AOKAgwIABABGHIgQyg5MA8=&rs=AOn4CLBPP2EMoCMYfmxlew67Rb4WEd4cLA
i.ytimg.com/vi/Q6R71SRnGwY/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgYChAMA8=&rs=AOn4CLBSWbfLzJgeRNI6KWqgMieqd3wumw
i.ytimg.com/vi/fCAhUkOePf4/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLDqOaFbpTDgBFW8xQ9-eKTHkabhjA
i.ytimg.com/vi/fCAhUkOePf4/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLC-m8tFGt-eBQK-6dknusK6ah0U2w

Important parts seems to be the last part of the path (-KpLmsAR23I, 1T3u9jjXcGM, 3CAmD8BAqG0, ...) which is the video ID and the file name (mqdefault.jpg, hqdefault.jpg, maxresdefault.jpg), the rest being mostly only noise / trackers / i don't know.

Problem arise because for some videos, Youtube player decides to load a different resolution:

Adding a fuzzy rule is going to be the solution, but I think it will be limited to current scraper capabilities, i.e. it won't be possible to say "prefer to use the maxresdefault.jpg if present in WARC, else use the hqdefault.jpg, else use the ...

Only solution seems to use whatever image will come first in the WARC for a given video:

First image to come (and hence selected to be included inside the ZIM, next ones will be ignored) might be the highest res image, or a lower res one, we don't know. But at least we will have an image!

benoit74 commented 4 months ago

Fixed by https://github.com/openzim/warc2zim/pull/274