openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 8 forks source link

Subtitles have a significant time offset #177

Closed benoit74 closed 2 months ago

benoit74 commented 3 months ago

For some (all?) videos, the subtitles are time shifted by about 4 to 5 secs, while they are properly aligned on TED web platform. It makes them very hard to use (or at least useless when you just need an aid to better understand a language you have difficulties to hear properly).

I've checked two videos, one from Youtube and one from TED CDN and they are both impacted.

benoit74 commented 3 months ago

Subtitles are retrieved from

E.g for https://www.ted.com/talks/matt_mills_image_recognition_that_triggers_augmented_reality, the URL the scraper considers is https://www.ted.com/talks/subtitles/id/1515/lang/fr

While the URL used on TED platform is now https://hls.ted.com/project_masters/1140/subtitles/fr/full.vtt?intro_master_id=2346

The import part seems to be the query parameter, if we remove it we get the same timings.

The full set of subtitles seems to be available from https://hls.ted.com/project_masters/1140/metadata.json?intro_master_id=2346 and this link seems to be available in the playerData of the video page.

This is quite a great simplification because:

Veeransh14 commented 2 months ago

@benoit74
I think for this particular issue, what changes should be made are:

  1. Extending the script for handling the adjustment of subtitles.
  2. Maybe we can define another function to adjust the subtitle timings (probably)
  3. If we do so we can integrate the subtitle adjustment into the workflow

I can think of this much and I have a probable code ready for the same, would love to get a pr issued on this, Thanks!

benoit74 commented 2 months ago

@Veeransh14 I don't get at all what you want to do. Your words are very generic and do not help at all to know if you've understood what has to be done.

Please be more specific in what you intend to do or I will probably have to work on this myself, it is an urgent topic to solve asap for us.

benoit74 commented 2 months ago

I'll take care of this issue myself right now

benoit74 commented 2 months ago

It looks like reality is way simpler than my complex explanation in previous comment regarding intros (which still have to be handled but seems to concern only a very small portions of videos)

Looking at the code, it seems that we've mostly always applied an offset of 11820 ms to subtitles

https://github.com/openzim/ted/blob/e0ab3d5b4f627452e9980cb3e831bff68ba5efb9/src/ted2zim/utils.py#L107-L109

@rgaudin do you have any rememberings of this magic value?

rgaudin commented 2 months ago

Now that you mention it it rings a bell but I'm pretty sure it was there before the refactor.

Veeransh14 commented 2 months ago

I am so sorry @benoit74, I could have been more specific, would take care of this henceforth, please do let me know if I can solve any other issues (if possible), meanwhile I would keep going through other issues if I could solve any. Thank you so much !