openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 8 forks source link

TED by topics fails with KeyError: 'videoData' #163

Closed benoit74 closed 4 months ago

benoit74 commented 4 months ago

Recipe: https://farm.openzim.org/recipes/ted_topic_motivation Task: https://farm.openzim.org/pipeline/0b5f1ef7-11ff-42da-99b4-0a31d348ad26/debug

[ted2zim::2024-03-02 08:22:20,385] DEBUG:extract_info_from_video_page: https://ted.com/talks/shawn_achor_the_happy_secret_to_better_work
[ted2zim::2024-03-02 08:22:22,109] ERROR:FAILED. An error occurred: 'videoData'
[ted2zim::2024-03-02 08:22:22,109] ERROR:'videoData'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/ted2zim/entrypoint.py", line 193, in main
    scraper.run()
  File "/usr/local/lib/python3.11/site-packages/ted2zim/scraper.py", line 1064, in run
    if not self.extract_videos_from_topics(topic):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ted2zim/scraper.py", line 291, in extract_videos_from_topics
    total_videos_scraped = self.generate_search_results(topic)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ted2zim/scraper.py", line 258, in generate_search_results
    ) = self.extract_videos_in_search_results(result_json)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ted2zim/scraper.py", line 410, in extract_videos_in_search_results
    if self.extract_info_from_video_page(url):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/ted2zim/scraper.py", line 633, in extract_info_from_video_page
    json_data = json.loads(
                ^^^^^^^^^^^
KeyError: 'videoData'
benoit74 commented 4 months ago

Problem not reproduced. I suggest to add try/except logic to at least log the HTML content we are trying to parse, so that we have more information next time. The server probably provided a weird content.