openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 8 forks source link

Cache the titles, descriptions and subtitles #200

Open benoit74 opened 1 month ago

benoit74 commented 1 month ago

Videos titles, descriptions and subtitles are not yet cached on S3.

They are however not expected to change much and are rather time-consuming to fetch (especially when the video has been translated into 10s of languages)

Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.

Subtitles have to be converted to proper format.

We should cache them and only refresh them when someone complains or one in a while, especially if we continue to want to update the ZIM on a very regular basis to fetch the few new videos that have been published.

kelson42 commented 1 week ago

Titles and descriptions requires to fetch the HTML page of the video for every language and parse it with Bettersoup to extract this.

How is that a problem? How measurable is that? I'm not in favour with upstream synchronisation based on time delays.... ETAG based solutions should be used.

benoit74 commented 1 week ago

As mentioned, first-order problem is that it is time-consuming to fetch (especially when the video has been translated into 10s of languages).

I don't have measure to share yet still we are now reencoding all the videos, so reencoding is the main share of task duration. But once reencoding will be complete, most task will just download videos from the cache. I will share them once available.

ETAGs are indeed available, not sure how well they work but should be ok, see https://www.ted.com/talks/oral_mcguire_how_to_live_with_fire?delay=5s&subtitle=en&trigger=30s

benoit74 commented 1 week ago

For instance on https://farm.openzim.org/pipeline/3241d2f3-c4d9-489d-98dc-67820f39e6c0/debug, these are the stats (all images and reencoded videos are already in S3 cache):

Download video infos from TED website: 16 mins Download images from cache: 1 min Download videos from cache: 13 mins Build the ZIM: few secs

So we spend more time downloading info from TED than downloading videos from cache.

benoit74 commented 1 week ago

(in mentioned task we finally had 23 videos to ZIM)