openzim / youtube

Create a ZIM file from a Youtube channel/username/playlist
GNU General Public License v3.0
43 stars 26 forks source link

Cache subtitles on S3 as well #277

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Currently only video thumbnails and video themselves are cached on S3.

This has the drawback that when an IP has been blacklisted from yt-dlp usage, the recipe fails to produce the ZIM even if all API calls have succeeded, because we use yt-dlp to download the subtitles.

Caching the subtitles on S3 would allow to create the ZIM.

kelson42 commented 1 month ago

Seems a good idea but do subtitles are served properly using etags?

dan-niles commented 1 month ago

Currently we're using yt-dlp to download subtitles and etags are not provided for subtitles. The response is in the following format:

"requested_subtitles": {
  "en": {
      "ext": "vtt",
      "url": "https://www.youtube.com/api/timedtext?v=DYvYGQHYScc&ei=rzKqZouKCqfWz7sPiu_E2Qw&caps=asr&opi=112496729&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1722455327&sparams=ip%2Cipbits%2Cexpire%2Cv%2Cei%2Ccaps%2Copi%2Cxoaf&signature=D55586A99B8028F2565AFE1F76F3F55D8BE2ECA6.E032AF517474302C806EE8A02C6CDC914CD903B9&key=yt8&lang=en&fmt=vtt",
      "name": "English"
  }
},

However the YouTube Data API (https://developers.google.com/youtube/v3/docs/captions#resource-representation) does provide etags for captions.

dan-niles commented 1 month ago

@benoit74 and I discussed the possibility of hashing the url of each subtitle provided by yt-dlp and using it as an etag. However, it seems that this URL changes every time it is fetched by yt-dlp.

I tried manually editing the subtitles of this video on the openZIM_testing YouTube channel to observe how the URL is affected. However, it appears that YouTube fetches the latest subtitles internally, and the query parameters in the URL don't seem to have an impact.