Open benoit74 opened 1 month ago
Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.
How is that a problem? How measurable is that? I'm not in favour with upstream synchronisation based on time delays.... ETAG based solutions should be used.
As mentioned, first-order problem is that it is time-consuming to fetch (especially when the video has been translated into 10s of languages).
I don't have measure to share yet still we are now reencoding all the videos, so reencoding is the main share of task duration. But once reencoding will be complete, most task will just download videos from the cache. I will share them once available.
ETAGs are indeed available, not sure how well they work but should be ok, see https://www.ted.com/talks/oral_mcguire_how_to_live_with_fire?delay=5s&subtitle=en&trigger=30s
For instance on https://farm.openzim.org/pipeline/3241d2f3-c4d9-489d-98dc-67820f39e6c0/debug, these are the stats (all images and reencoded videos are already in S3 cache):
Download video infos from TED website: 16 mins Download images from cache: 1 min Download videos from cache: 13 mins Build the ZIM: few secs
So we spend more time downloading info from TED than downloading videos from cache.
(in mentioned task we finally had 23 videos to ZIM)
Videos titles, descriptions and subtitles are not yet cached on S3.
They are however not expected to change much and are rather time-consuming to fetch (especially when the video has been translated into 10s of languages)
Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.
Subtitles have to be converted to proper format.
We should cache them and only refresh them when someone complains or one in a while, especially if we continue to want to update the ZIM on a very regular basis to fetch the few new videos that have been published.