Cache subtitles on S3 as well

benoit74 commented 3 months ago

Currently only video thumbnails and video themselves are cached on S3.

This has the drawback that when an IP has been blacklisted from yt-dlp usage, the recipe fails to produce the ZIM even if all API calls have succeeded, because we use yt-dlp to download the subtitles.

Caching the subtitles on S3 would allow to create the ZIM.

kelson42 commented 3 months ago

Seems a good idea but do subtitles are served properly using etags?

dan-niles commented 3 months ago

Currently we're using yt-dlp to download subtitles and etags are not provided for subtitles. The response is in the following format:

"requested_subtitles": {
  "en": {
      "ext": "vtt",
      "url": "https://www.youtube.com/api/timedtext?v=DYvYGQHYScc&ei=rzKqZouKCqfWz7sPiu_E2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1722455327&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=D55586A99B8028F2565AFE1F76F3F55D8BE2ECA6.E032AF517474302C806EE8A02C6CDC914CD903B9&key=yt8&lang=en&fmt=vtt",
      "name": "English"
  }
},

However the YouTube Data API (https://developers.google.com/youtube/v3/docs/captions#resource-representation) does provide etags for captions.

dan-niles commented 3 months ago

@benoit74 and I discussed the possibility of hashing the url of each subtitle provided by yt-dlp and using it as an etag. However, it seems that this URL changes every time it is fetched by yt-dlp.

I tried manually editing the subtitles of this video on the openZIM_testing YouTube channel to observe how the URL is affected. However, it appears that YouTube fetches the latest subtitles internally, and the query parameters in the URL don't seem to have an impact.

openzim / youtube

Cache subtitles on S3 as well #277