jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.55k stars 280 forks source link

video disabled due to region lock shows Transcript/subtitle disabled. #213

Open michaelthwan opened 1 year ago

michaelthwan commented 1 year ago

To Reproduce

Steps to reproduce the behavior:

What code / cli command are you executing?

A user tried extracting this video https://www.youtube.com/watch?v=kZsVStYdmws This video is available in only some regions (e.g. Hong Kong, Taiwan) but not for the others (e.g. United States). Therefore, it works in local (Hong Kong) but after deployment (to a US server), it will shows Subtitles are disabled for this video

This code can reproduce that, it worked if using VPN for HK region. Doesn't work for US

video_id = "kZsVStYdmws"
YouTubeTranscriptApi.list_transcripts(video_id)

Which Python version are you using?

Python 3.10.8

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.0

Expected behavior

Describe what you expected to happen. I think it is okay that region which disabled the video cannot fetch transcript, but the exception is confusing that I troubleshot for a while to understand why it happened.

Potentially, it is because it entered raise TranscriptsDisabled part. Therefore maybe adding one more exception handling helps.

    def _extract_captions_json(self, html, video_id):
        splitted_html = html.split('"captions":')

        if len(splitted_html) <= 1:
            if video_id.startswith('http://') or video_id.startswith('https://'):
                raise InvalidVideoId(video_id)
            if 'class="g-recaptcha"' in html:
                raise TooManyRequests(video_id)
            if '"playabilityStatus":' not in html:
                raise VideoUnavailable(video_id)

          **Here, added exception**

            **raise TranscriptsDisabled**(video_id)

Actual behaviour

it will shows Subtitles are disabled for this video for disabled video region even the subtitle is enabled.

michaelthwan commented 1 year ago

I will respect whether you fix it or not. Thanks for handling

jdepoix commented 1 year ago

Hi @michaelthwan, thank you for reporting. I agree: this is not something we can do anything about, but a more descriptive error message would be nice. I am currently a bit short on time to implement this myself, but I will put it on the list and contributions will be very much welcome! 😊

crhowell commented 12 months ago

@jdepoix I finally had some down time, taking a look at this issue.

As far as what YouTube identifies this error as its still considered "Video unavailable" for the main reason, but has subreason text that displays The uploader has not made this video available in your country

In the browser, in place of the video not loading due to a region lock we get a black background with white text showing:

Video unavailable The uploader has not made this video available in your country

In the HTML we end up with this to search against

"playabilityStatus":{"status":"UNPLAYABLE","reason":"Video unavailable","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"The uploader has not made this video available in your country"}]}

We could do a new error message class such as this? To keep it somewhat inline with whats in the response of YouTube.

# file: youtube_transcript_api/_errors.py

class VideoUnplayable(CouldNotRetrieveTranscript):
    CAUSE_MESSAGE = 'The video has not been made available in your country'

Though it would be another search for an exact string match against html such as

def _extract_captions_json(self, html, video_id):
    splitted_html = html.split('"captions":')

    if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        if '"playabilityStatus":' not in html:
            raise VideoUnavailable(video_id)     

        # add something like this
        if 'The uploader has not made this video available in your country' in html:
            raise VideoUnplayable(video_id)

Its a little fragile but I think you've once said before that technically this entire API is unofficial and could break at any time anyway. Let me know what you think. I could PR this in and probably add a test case or two while I have some down time.

crhowell commented 12 months ago

@jdepoix Interestingly enough we could also add an Age-related error class as well. Although it seems we could get around the age-related retrieval of a transcript since you can pull a transcript regardless if you are logged in or not. To do that would require adding logic around my findings in #110. But until we have that workaround implemented we could at least throw an appropriate error a very similar way as the country/region lock since the HTML to match on for that lives in the same spot and looks like this.

"playabilityStatus":{"status":"LOGIN_REQUIRED","reason":"Sign in to confirm your age","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"This video may be inappropriate for some users."}]}

This would let us also sign off #111 until a workaround is implemented.

jdepoix commented 11 months ago

Hi @crhowell, thanks for looking into this and sorry for the late reply! It looks like the data in "playabilityStatus" could generally be useful to provide more helpful exceptions and error messages! We could add a exception type for each status (LoginRequired, VideoUnplayable) which render playabilityStatus.reason and playabilityStatus.errorScreen.playerErrorMessageRenderer.subreason.runs as part of the error message. However, just looking for a natural language string in the html definitely is too fragile, as this probably will be in a different language depending on the locale. But isn't this part of the json we are parsing in json.loads(splitted_html[1].split(',"videoDetails')[0].replace('\n', '')) anyways? In that case we could just check what the status is and throw the corresponding exception, while passing in the reason/subreason. If it is not part of the json we are currently parsing, I guess we should find a way to parse it, since everything else will be very fragile.

crhowell commented 11 months ago

@jdepoix Well its branched logic in there based on whether or not splitted_html has an index 1 or not.

Basically if we split the html html.split('"captions":') on captions. If that List has a length less than or equal to 1. We will ALWAYS raise an exception and json.loads never runs.

Otherwise, that means if we have more than 1 index position our list, we do try to parse the 1st index position.

But in our case for these specific errors, from what ive inspected via debug breakpoint we do not have more than 1 index position so we would never hit the json.loads side of the branching, we always raise the exception which leaves us back with the fragile in html statement.

Let me include a snippet of the full function logic

def _extract_captions_json(self, html, video_id):
    splitted_html = html.split('"captions":')
    if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        if '"playabilityStatus":' not in html:
            raise VideoUnavailable(video_id)
        # NOTE: this is where we hit for our current issues errors.
        raise TranscriptsDisabled(video_id)

    captions_json = json.loads(
        splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
    ).get('playerCaptionsTracklistRenderer')
    if captions_json is None:
        raise TranscriptsDisabled(video_id)

    if 'captionTracks' not in captions_json:
        raise NoTranscriptAvailable(video_id)

    return captions_json

Update Confirmed that both the Age Restricted video and Country/Region locked video len(splitted_html) will be 1

michaelthwan commented 11 months ago

You guys are very helpful. Thank you @crhowell @jdepoix

jdepoix commented 11 months ago

Hi @crhowell, yeah, that makes sense, but this should be solvable 😊

if len(splitted_html) <= 1:
        if video_id.startswith('http://') or video_id.startswith('https://'):
            raise InvalidVideoId(video_id)
        if 'class="g-recaptcha"' in html:
            raise TooManyRequests(video_id)
        splitted_html = html.split('"playabilityStatus":')
        if len(splitted_html) <= 1:
            raise VideoUnavailable(video_id)

        playability_status_json = json.loads(
            splitted_html[1].split(',"WHAT_EVER_THE_NEXT_PROPERTY_IS')[0].replace('\n', '')
        )

        # ... handle playability_status_json ...

        # fallback if we don't know the status
        raise TranscriptsDisabled(video_id)
crhowell commented 11 months ago

@jdepoix I can throw an initial pass PR together for this I have a partial solution already. Ill test it against Age/Region error cases as well as the valid working cases so we can see what kind of "reason" shows up when everything is working fine and transcripts are retrievable.

Ill tag you for review on it once submitted.

Update PR https://github.com/jdepoix/youtube-transcript-api/pull/219

Note, this PR is a quick first pass at it. Worth testing against more video IDs, I am sure there are some edge cases and more "status" values we might be able to get to add as custom errors.

I did a little bit of testing. Let me know what you do or dont like we can tweak it as necessary. I need to add a few tests for the helpers, so coverage dropped a tiny bit due to that.