Open xenova opened 2 years ago
Hi @xenova, thanks for reporting this.
The problem here seems to be that both English - CC1
and English - DTVCC1
use the language code en
(same for the Romansh). The TranscriptList
object holds the transcripts in a dict where the language code is the key. Therefore, there can only be one transcript per language code in that dict. I wasn't aware that multiple transcripts using the same language code is a thing 😱
I am afraid this can't be fixed without introducing breaking changes, as we apparently can no longer consider the language code a reliable identifier. Have you encountered multiple instances of this happening? How big of a problem is this? 🤔
Oh wow that's quite surprising!
I have downloaded > 1 million transcripts for an ML project I'm working on (https://www.github.com/xenova/sponsorblock-ml) and only had 1 problem with this, so, it is most likely not that big of an issue.
Thanks for reporting back! It's good to know, that this isn't too much of an issue. I have never encountered it myself, although I have scraped quite a few of transcripts.
I might just leave this as is. To fix this we would have to return a list
from all calls which currently just retrieve a single transcript, to account for the unlikely event that multiple transcripts could be returned. This would require quite a bit of rewriting and would most likely break a lot of code depending on this module. The other option is to retrieve transcripts for a given language using its vssId
, however, this seems way more impractical, as that would require the user (of this module) to first find out the vssId
of the language he/she is looking for.
I guess the only practical option is adding the vssId
as an optional param to fetch
, or a separate fetchByVssId
method, which would at least provide a way to work around this in case you are encountering this issue. This still requires a bit of rewriting as the TranscriptList
can no longer use dict
s internally. The fetch
method could then throw an exception when it is asked to retrieve a transcript for a language code it has multiple transcripts for, to let the user know, that vssId
must be used here.
Any thoughts on this?
Right, this is definitely a simple problem with an anything-but-simple solution.
As you mentioned, the most important thing is not to break code that breaks modules which depend on it, so your second option seems quite practical.
I have seen implementations (in django I believe) of a "MultiDict" (or something like that) which acts exactly as a dictionary (allowing for indexing), but allows for duplicate keys. This is normally implemented by mapping keys to a list, and when indexing, you just return the first element.
Another way to implemented with an auxiliary dictionary used to map keys to the index of their first appearance (so that you can still index normally), but allows for iterating over the container if you need a specific item.
For example, you could have a multidict:
d = { 'a': 1, 'b': 2, 'a': 3 }
such that d['a']
returns 1 and d['b']
returns 2. As mentioned above, this would be implemented by storing a list of values x = [1,2,3]
, and a dictionary y = {'a': 1, 'b': 2}
, such that d['a'] = x[y['a']]=1
.
When fetching transcripts for https://www.youtube.com/watch?v=gdsUKphmB3Y, I only get a subset of the available transcripts.
Using library:
On YouTube: