jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.54k stars 279 forks source link

Bug in listing all available subtitle tracks #288

Open Angel756984 opened 1 month ago

Angel756984 commented 1 month ago

Hi, I've discovered a bug in listing all available subtitle tracks for videos with more manually created transcripts with same language code. I'm using latest Python on latest Windows and PyCharm, but it does not care, with online services like repl.it is exactly the same.

Only as example, you can test with the following video Trump campaign sets sights on another deep-blue state and in general with all videos published on the Fox News channel, but the same for quite a few channels of broadcast networks, as they have all the following caption tracks:

image

You can run the below code:

from youtube_transcript_api import YouTubeTranscriptApi
subs = YouTubeTranscriptApi.list_transcripts('STjvfE4HVXY')
for sub in subs:
   print(f'code:<{sub.language_code}> auto:<{sub.is_generated}> lang:<{sub.language}>')

to obtain the next result:

image

This is the PyCharm debug view:

image

As you can see the track CC1 is missing from the list of tracks available for the video which I believe is not present as among the manually created tracks both track CC1 and track DTVCC1 have same language code 'en' which is used as the only key for the dictionary separated only for autogenerated and manually generated tracks. So given track CC1 - as shown below in the JSON extracted from the HTML video page - is listed first than track DTVCC1, the code saves CC1 in the dictionary with key 'en' among the manually generated tracks and when it later finds DTVCC1 with same code 'en' again in manually generated it overwrites CC1 which then no longer appears.

  "captions": {
    "playerCaptionsTracklistRenderer": {
      "captionTracks": [
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese (generati automaticamente)"
          },
          "vssId": "a.en",
          "languageCode": "en",
          "kind": "asr",
          "isTranslatable": true,
          "trackName": ""
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese - CC1"
          },
          "vssId": ".en.uYU-mmqFLq8",
          "languageCode": "en",
          "isTranslatable": true,
          "trackName": "CC1"
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese - DTVCC1"
          },
          "vssId": ".en.JkeT_87f4cc",
          "languageCode": "en",
          "isTranslatable": true,
          "trackName": "DTVCC1"
        },
        {
          "baseUrl": "...",
          "name": {
            "simpleText": "Inglese (Stati Uniti)"
          },
          "vssId": ".en-US",
          "languageCode": "en-US",
          "isTranslatable": true,
          "trackName": ""
        }
      ],

This bug I believe can be solved by using 'trackName' as an addition to the dictionary tracks key so that keys such as 'en CC1' and 'en-DTVCC1' would no longer cause loss of tracks with same language code and both manually generated.

Thank you and let me know please.