Open michaelthwan opened 1 year ago
I will respect whether you fix it or not. Thanks for handling
Hi @michaelthwan, thank you for reporting. I agree: this is not something we can do anything about, but a more descriptive error message would be nice. I am currently a bit short on time to implement this myself, but I will put it on the list and contributions will be very much welcome! 😊
@jdepoix I finally had some down time, taking a look at this issue.
As far as what YouTube identifies this error as its still considered "Video unavailable" for the main reason, but has subreason text that displays The uploader has not made this video available in your country
In the browser, in place of the video not loading due to a region lock we get a black background with white text showing:
Video unavailable The uploader has not made this video available in your country
In the HTML we end up with this to search against
"playabilityStatus":{"status":"UNPLAYABLE","reason":"Video unavailable","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"The uploader has not made this video available in your country"}]}
We could do a new error message class such as this? To keep it somewhat inline with whats in the response of YouTube.
# file: youtube_transcript_api/_errors.py
class VideoUnplayable(CouldNotRetrieveTranscript):
CAUSE_MESSAGE = 'The video has not been made available in your country'
Though it would be another search for an exact string match against html
such as
def _extract_captions_json(self, html, video_id):
splitted_html = html.split('"captions":')
if len(splitted_html) <= 1:
if video_id.startswith('http://') or video_id.startswith('https://'):
raise InvalidVideoId(video_id)
if 'class="g-recaptcha"' in html:
raise TooManyRequests(video_id)
if '"playabilityStatus":' not in html:
raise VideoUnavailable(video_id)
# add something like this
if 'The uploader has not made this video available in your country' in html:
raise VideoUnplayable(video_id)
Its a little fragile but I think you've once said before that technically this entire API is unofficial and could break at any time anyway. Let me know what you think. I could PR this in and probably add a test case or two while I have some down time.
@jdepoix Interestingly enough we could also add an Age-related error class as well. Although it seems we could get around the age-related retrieval of a transcript since you can pull a transcript regardless if you are logged in or not. To do that would require adding logic around my findings in #110. But until we have that workaround implemented we could at least throw an appropriate error a very similar way as the country/region lock since the HTML to match on for that lives in the same spot and looks like this.
"playabilityStatus":{"status":"LOGIN_REQUIRED","reason":"Sign in to confirm your age","errorScreen":{"playerErrorMessageRenderer":{"subreason":{"runs":[{"text":"This video may be inappropriate for some users."}]}
This would let us also sign off #111 until a workaround is implemented.
Hi @crhowell, thanks for looking into this and sorry for the late reply!
It looks like the data in "playabilityStatus"
could generally be useful to provide more helpful exceptions and error messages! We could add a exception type for each status (LoginRequired
, VideoUnplayable
) which render playabilityStatus.reason
and playabilityStatus.errorScreen.playerErrorMessageRenderer.subreason.runs
as part of the error message. However, just looking for a natural language string in the html definitely is too fragile, as this probably will be in a different language depending on the locale. But isn't this part of the json we are parsing in json.loads(splitted_html[1].split(',"videoDetails')[0].replace('\n', ''))
anyways? In that case we could just check what the status is and throw the corresponding exception, while passing in the reason/subreason. If it is not part of the json we are currently parsing, I guess we should find a way to parse it, since everything else will be very fragile.
@jdepoix Well its branched logic in there based on whether or not splitted_html
has an index 1
or not.
Basically if we split the html html.split('"captions":')
on captions. If that List has a length less than or equal to 1. We will ALWAYS raise an exception and json.loads
never runs.
Otherwise, that means if we have more than 1 index position our list, we do try to parse the 1st index position.
But in our case for these specific errors, from what ive inspected via debug breakpoint
we do not have more than 1 index position so we would never hit the json.loads
side of the branching, we always raise the exception which leaves us back with the fragile in html
statement.
Let me include a snippet of the full function logic
def _extract_captions_json(self, html, video_id):
splitted_html = html.split('"captions":')
if len(splitted_html) <= 1:
if video_id.startswith('http://') or video_id.startswith('https://'):
raise InvalidVideoId(video_id)
if 'class="g-recaptcha"' in html:
raise TooManyRequests(video_id)
if '"playabilityStatus":' not in html:
raise VideoUnavailable(video_id)
# NOTE: this is where we hit for our current issues errors.
raise TranscriptsDisabled(video_id)
captions_json = json.loads(
splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
).get('playerCaptionsTracklistRenderer')
if captions_json is None:
raise TranscriptsDisabled(video_id)
if 'captionTracks' not in captions_json:
raise NoTranscriptAvailable(video_id)
return captions_json
Update
Confirmed that both the Age Restricted video and Country/Region locked video len(splitted_html)
will be 1
You guys are very helpful. Thank you @crhowell @jdepoix
Hi @crhowell, yeah, that makes sense, but this should be solvable 😊
if len(splitted_html) <= 1:
if video_id.startswith('http://') or video_id.startswith('https://'):
raise InvalidVideoId(video_id)
if 'class="g-recaptcha"' in html:
raise TooManyRequests(video_id)
splitted_html = html.split('"playabilityStatus":')
if len(splitted_html) <= 1:
raise VideoUnavailable(video_id)
playability_status_json = json.loads(
splitted_html[1].split(',"WHAT_EVER_THE_NEXT_PROPERTY_IS')[0].replace('\n', '')
)
# ... handle playability_status_json ...
# fallback if we don't know the status
raise TranscriptsDisabled(video_id)
@jdepoix I can throw an initial pass PR together for this I have a partial solution already. Ill test it against Age/Region error cases as well as the valid working cases so we can see what kind of "reason" shows up when everything is working fine and transcripts are retrievable.
Ill tag you for review on it once submitted.
Update PR https://github.com/jdepoix/youtube-transcript-api/pull/219
Note, this PR is a quick first pass at it. Worth testing against more video IDs, I am sure there are some edge cases and more "status" values we might be able to get to add as custom errors.
I did a little bit of testing. Let me know what you do or dont like we can tweak it as necessary. I need to add a few tests for the helpers, so coverage dropped a tiny bit due to that.
To Reproduce
Steps to reproduce the behavior:
What code / cli command are you executing?
A user tried extracting this video https://www.youtube.com/watch?v=kZsVStYdmws This video is available in only some regions (e.g. Hong Kong, Taiwan) but not for the others (e.g. United States). Therefore, it works in local (Hong Kong) but after deployment (to a US server), it will shows
Subtitles are disabled for this video
This code can reproduce that, it worked if using VPN for HK region. Doesn't work for US
Which Python version are you using?
Python 3.10.8
Which version of youtube-transcript-api are you using?
youtube-transcript-api 0.6.0
Expected behavior
Describe what you expected to happen. I think it is okay that region which disabled the video cannot fetch transcript, but the exception is confusing that I troubleshot for a while to understand why it happened.
Potentially, it is because it entered
raise TranscriptsDisabled
part. Therefore maybe adding one more exception handling helps.Actual behaviour
it will shows
Subtitles are disabled for this video
for disabled video region even the subtitle is enabled.