jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
MIT License
2.87k stars 326 forks source link

Expecting ',' delimiter: line 1 column 1576 (char 1575) #131

Closed thoughtfuldata closed 2 years ago

thoughtfuldata commented 2 years ago

This may be out of scope as I am using youtube-transcript-api with parallel processing and the issue only happens with it. However I believe it is the way the youtube-transcript-api that is handling that error that is the bug.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 : 10.0.19041
youtube-transcript-api Version: 0.4.1
Python version: 3.9.6

I originally believe it to be an issue with the parallel processing package, however after speaking with that maintainer of that package. The guess is that it could be:

by you, he's referring to me

my guess would be you bombed some server with too many concurrent requests, _fetch_video_html gave up and returned some 500 Internal Server Error or so response (some non-200 response which does not contain a payload and so can not be json decoded). Maybe they forgot to add a response.raise_for_status() which would have made this traceback more verbose.

Heres the remote traceback


---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\multiprocess\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\pathos\helpers\mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\mapply\mapply.py", line 105, in run_apply
    return df_or_series.apply(func, args=args, **kwargs)
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\pandas\core\frame.py", line 8740, in apply
    return op.apply()
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\pandas\core\apply.py", line 688, in apply
    return self.apply_standard()
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\pandas\core\apply.py", line 812, in apply_standard
    results, res_index = self.apply_series_generator()
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\pandas\core\apply.py", line 828, in apply_series_generator
    results[i] = self.f(v)
  File "C:\Users\manue\AppData\Local\Temp/ipykernel_10772/780513497.py", line 5, in get_transcripts
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_api.py", line 128, in get_transcript
    return cls.list_transcripts(video_id, proxies, cookies).find_transcript(languages).fetch()
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_api.py", line 70, in list_transcripts
    return TranscriptListFetcher(http_client).fetch(video_id)
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_transcripts.py", line 36, in fetch
    self._extract_captions_json(self._fetch_video_html(video_id), video_id)
  File "c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_transcripts.py", line 50, in _extract_captions_json
    captions_json = json.loads(
  File "C:\Users\manue\.pyenv\pyenv-win\versions\3.9.6\lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "C:\Users\manue\.pyenv\pyenv-win\versions\3.9.6\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\manue\.pyenv\pyenv-win\versions\3.9.6\lib\json\decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 1576 (char 1575)
"""

Let me know if anything else is needed.

thoughtfuldata commented 2 years ago

Actually after more research, it seems to happen once in a while with this

youtube_transcript_api.YouTubeTranscriptApi.get_transcript('BaxBFnIUTrc')

c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_api.py in get_transcript(cls, video_id, languages, proxies, cookies)
    126         :rtype [{'text': str, 'start': float, 'end': float}]:
    127         """
--> 128         return cls.list_transcripts(video_id, proxies, cookies).find_transcript(languages).fetch()
    129 
    130     @classmethod

c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_api.py in list_transcripts(cls, video_id, proxies, cookies)
     68                 http_client.cookies = cls._load_cookies(cookies, video_id)
     69             http_client.proxies = proxies if proxies else {}
---> 70             return TranscriptListFetcher(http_client).fetch(video_id)
     71 
     72     @classmethod

c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_transcripts.py in fetch(self, video_id)
     34             self._http_client,
     35             video_id,
---> 36             self._extract_captions_json(self._fetch_video_html(video_id), video_id)
     37         )
     38 

c:\Users\manue\Documents\Github\data-science-venv\.venv\lib\site-packages\youtube_transcript_api\_transcripts.py in _extract_captions_json(self, html, video_id)
     48             raise TranscriptsDisabled(video_id)
     49 
---> 50         captions_json = json.loads(
     51             splitted_html[1].split(',"videoDetails')[0].replace('\n', '')
     52         )['playerCaptionsTracklistRenderer']

~\.pyenv\pyenv-win\versions\3.9.8\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    344             parse_int is None and parse_float is None and
    345             parse_constant is None and object_pairs_hook is None and not kw):
--> 346         return _default_decoder.decode(s)
    347     if cls is None:
    348         cls = JSONDecoder

~\.pyenv\pyenv-win\versions\3.9.8\lib\json\decoder.py in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

~\.pyenv\pyenv-win\versions\3.9.8\lib\json\decoder.py in raw_decode(self, s, idx)
    351         """
    352         try:
--> 353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
    355             raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting ',' delimiter: line 1 column 1576 (char 1575) 
jdepoix commented 2 years ago

Hi @thoughtfuldata, it's hard to analyse much without seeing your code, but given the fact that you're doing multiple requests in parallel this is not surprising at all. One of the reoccurring problems when using this module is that YouTube tends to block requests when you are executing too many at a time and there's not really anything we can do about that. So when you're parallelising requests, this problem will only become more apparent.

However, I agree that I should add a raise_for_status() in _fetch_video_html() and return an exception wrapping the status code. Unfortunately, this won't really fix your problem though.

jdepoix commented 2 years ago

Error message for error status codes is being added in #132

thoughtfuldata commented 2 years ago

Thanks!

This helps me out

jdepoix commented 2 years ago

I forgot to mention: the improved error message has been released with version 0.4.2