Missing transcripts - Githubissues

xenova commented 2 years ago

When fetching transcripts for https://www.youtube.com/watch?v=gdsUKphmB3Y, I only get a subset of the available transcripts.

Using library:

>>> from youtube_transcript_api import YouTubeTranscriptApi
>>> video_id = 'gdsUKphmB3Y'
>>> transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
>>> for t in transcript_list:
...     print(t)
... 
en ("English - DTVCC1")[TRANSLATABLE]
rm ("Romansh - DTVCC3")
en ("English (auto-generated)")[TRANSLATABLE]

On YouTube:

jdepoix commented 2 years ago

Hi @xenova, thanks for reporting this. The problem here seems to be that both English - CC1 and English - DTVCC1 use the language code en (same for the Romansh). The TranscriptList object holds the transcripts in a dict where the language code is the key. Therefore, there can only be one transcript per language code in that dict. I wasn't aware that multiple transcripts using the same language code is a thing 😱

I am afraid this can't be fixed without introducing breaking changes, as we apparently can no longer consider the language code a reliable identifier. Have you encountered multiple instances of this happening? How big of a problem is this? 🤔

xenova commented 2 years ago

Oh wow that's quite surprising!

I have downloaded > 1 million transcripts for an ML project I'm working on (https://www.github.com/xenova/sponsorblock-ml) and only had 1 problem with this, so, it is most likely not that big of an issue.

jdepoix commented 2 years ago

Thanks for reporting back! It's good to know, that this isn't too much of an issue. I have never encountered it myself, although I have scraped quite a few of transcripts.

I might just leave this as is. To fix this we would have to return a list from all calls which currently just retrieve a single transcript, to account for the unlikely event that multiple transcripts could be returned. This would require quite a bit of rewriting and would most likely break a lot of code depending on this module. The other option is to retrieve transcripts for a given language using its vssId, however, this seems way more impractical, as that would require the user (of this module) to first find out the vssId of the language he/she is looking for.

I guess the only practical option is adding the vssId as an optional param to fetch, or a separate fetchByVssId method, which would at least provide a way to work around this in case you are encountering this issue. This still requires a bit of rewriting as the TranscriptList can no longer use dicts internally. The fetch method could then throw an exception when it is asked to retrieve a transcript for a language code it has multiple transcripts for, to let the user know, that vssId must be used here.

Any thoughts on this?

xenova commented 2 years ago

Right, this is definitely a simple problem with an anything-but-simple solution.

As you mentioned, the most important thing is not to break code that breaks modules which depend on it, so your second option seems quite practical.

I have seen implementations (in django I believe) of a "MultiDict" (or something like that) which acts exactly as a dictionary (allowing for indexing), but allows for duplicate keys. This is normally implemented by mapping keys to a list, and when indexing, you just return the first element.

Another way to implemented with an auxiliary dictionary used to map keys to the index of their first appearance (so that you can still index normally), but allows for iterating over the container if you need a specific item.

For example, you could have a multidict: d = { 'a': 1, 'b': 2, 'a': 3 } such that d['a'] returns 1 and d['b'] returns 2. As mentioned above, this would be implemented by storing a list of values x = [1,2,3], and a dictionary y = {'a': 1, 'b': 2}, such that d['a'] = x[y['a']]=1.

jdepoix / youtube-transcript-api

Missing transcripts #150