Closed kelson42 closed 4 months ago
Impact is wider than only FF on iOS 17.5, it is just broken
I ran again a new crawl and conversion and this times, it is playing on FF on macOS 12.7.4, we just miss the placeholder image on few Youtube videos (due to what looks like obscure Youtube player decisions regarding which image resolution should be used)
I did a small test run of zimit/warc2zim on only the homepage of mesquartierschinois.
First thing, https://replayweb.page/ suffers from the same issue to display video thumbnail / placeholder, so it is not a bug "by-design" in warc2zim, more a problem of Youtube player dynamic behavior.
Analyzing a bit Youtube player behavior, it looks like all thumbnails/placeholders for videos are served from i.ytimg.com
domain.
WARC records from this domain present in the test WARC are:
https://i.ytimg.com/vi/3CAmD8BAqG0/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AHUBoAC4AOKAgwIABABGHIgQyg5MA8=&rs=AOn4CLBPP2EMoCMYfmxlew67Rb4WEd4cLA
https://i.ytimg.com/vi/fCAhUkOePf4/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLDqOaFbpTDgBFW8xQ9-eKTHkabhjA
https://i.ytimg.com/vi/1T3u9jjXcGM/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLAZc_9BWYzjlZe4SAEnSYoMa6Jfgg
https://i.ytimg.com/vi/-KpLmsAR23I/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g
https://i.ytimg.com/vi/fCAhUkOePf4/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLC-m8tFGt-eBQK-6dknusK6ah0U2w
https://i.ytimg.com/vi/Q6R71SRnGwY/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgYChAMA8=&rs=AOn4CLBSWbfLzJgeRNI6KWqgMieqd3wumw
https://i.ytimg.com/vi/1T3u9jjXcGM/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLCSrEzSzni7p_UdF9uieKu1XDl_fQ
Thumbnail/placeholder images present inside the ZIM for this domain are in accordance with WARC content:
i.ytimg.com/vi/-KpLmsAR23I/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgTyg-MA8=&rs=AOn4CLDr-FmDmP3aCsD84l48ygBmkwHg-g
i.ytimg.com/vi/1T3u9jjXcGM/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLAZc_9BWYzjlZe4SAEnSYoMa6Jfgg
i.ytimg.com/vi/1T3u9jjXcGM/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG-B4AC0AWKAgwIABABGHIgVig-MA8=&rs=AOn4CLCSrEzSzni7p_UdF9uieKu1XDl_fQ
i.ytimg.com/vi/3CAmD8BAqG0/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AHUBoAC4AOKAgwIABABGHIgQyg5MA8=&rs=AOn4CLBPP2EMoCMYfmxlew67Rb4WEd4cLA
i.ytimg.com/vi/Q6R71SRnGwY/maxresdefault.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGHIgYChAMA8=&rs=AOn4CLBSWbfLzJgeRNI6KWqgMieqd3wumw
i.ytimg.com/vi/fCAhUkOePf4/hqdefault.jpg?sqp=-oaymwEmCOADEOgC8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLDqOaFbpTDgBFW8xQ9-eKTHkabhjA
i.ytimg.com/vi/fCAhUkOePf4/mqdefault.jpg?sqp=-oaymwEmCMACELQB8quKqQMa8AEB-AG2BIACwAKKAgwIABABGH8gMCgTMA8=&rs=AOn4CLC-m8tFGt-eBQK-6dknusK6ah0U2w
Important parts seems to be the last part of the path (-KpLmsAR23I
, 1T3u9jjXcGM
, 3CAmD8BAqG0
, ...) which is the video ID and the file name (mqdefault.jpg
, hqdefault.jpg
, maxresdefault.jpg
), the rest being mostly only noise / trackers / i don't know.
Problem arise because for some videos, Youtube player decides to load a different resolution:
1T3u9jjXcGM
video, it tries to load sddefault.jpg
=> missing in ZIMfCAhUkOePf4
video, it tries to load hqdefault.jpg
=> present in ZIM3CAmD8BAqG0
video, it tries to load hqdefault.jpg
=> present in ZIMQ6R71SRnGwY
video, it tries to load sddefault.jpg
=> missing in ZIM-KpLmsAR23I
video, it tries to load sddefault.jpg
=> missing in ZIMAdding a fuzzy rule is going to be the solution, but I think it will be limited to current scraper capabilities, i.e. it won't be possible to say "prefer to use the maxresdefault.jpg
if present in WARC, else use the hqdefault.jpg
, else use the ...
Only solution seems to use whatever image will come first in the WARC for a given video:
i\.ytimg\.com\/vi\/(.*?)\/(.*?)\.jpg.*
i.ytimg.com.fuzzy.replayweb.page/vi/\1/thumbnail.jpg
First image to come (and hence selected to be included inside the ZIM, next ones will be ignored) might be the highest res image, or a lower res one, we don't know. But at least we will have an image!
At https://dev.library.kiwix.org/viewer#mes-quartiers-chinois_fr_all_2024-05/mesquartierschinois.wordpress.com/
video poster is not displayed (black video) and playing video foes not work