Open benoit74 opened 3 months ago
Seems a good idea but do subtitles are served properly using etags?
Currently we're using yt-dlp
to download subtitles and etags are not provided for subtitles. The response is in the following format:
"requested_subtitles": {
"en": {
"ext": "vtt",
"url": "https://www.youtube.com/api/timedtext?v=DYvYGQHYScc&ei=rzKqZouKCqfWz7sPiu_E2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1722455327&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=D55586A99B8028F2565AFE1F76F3F55D8BE2ECA6.E032AF517474302C806EE8A02C6CDC914CD903B9&key=yt8&lang=en&fmt=vtt",
"name": "English"
}
},
However the YouTube Data API (https://developers.google.com/youtube/v3/docs/captions#resource-representation) does provide etags for captions.
@benoit74 and I discussed the possibility of hashing the url
of each subtitle provided by yt-dlp
and using it as an etag. However, it seems that this URL changes every time it is fetched by yt-dlp
.
I tried manually editing the subtitles of this video on the openZIM_testing
YouTube channel to observe how the URL is affected. However, it appears that YouTube fetches the latest subtitles internally, and the query parameters in the URL don't seem to have an impact.
Currently only video thumbnails and video themselves are cached on S3.
This has the drawback that when an IP has been blacklisted from yt-dlp usage, the recipe fails to produce the ZIM even if all API calls have succeeded, because we use yt-dlp to download the subtitles.
Caching the subtitles on S3 would allow to create the ZIM.