Closed benoit74 closed 2 months ago
Moving this to 3.1.0, it is mostly straightforward to implement and seems to be impacting about 5-10% of the recipes randomly.
Since we have currently no plan on when we will be able to work on 3.1.0 and since this bug makes the success of https://farm.openzim.org/recipes/ted_topic_all mostly impossible, I'm going to make a patch release 3.0.3
In order to retrieve video infos, TED scraper retrieves the video page with a URL like https://ted.com/talks/franco_sacchi_a_tour_of_nollywood_nigeria_s_booming_film_industry?language=nl and will look for
__NEXT_DATA__
JSON inside the page, where it will find among other things the localized title and description.This is done in
extract_info_from_video_page
function inscraper.py
.We currently have few recipes intermittently failing with an error
An error occurred: 'NoneType' object has no attribute 'string'
.Looking at HTML content, there is no
__NEXT_DATA__
JSON inside the page.Loading again the page on my machine, there is
__NEXT_DATA__
JSON.So clearly the scraper should be more resilient to intermittent bad responses from TED server.
This was indeed the case in 2.10.0 where there was a retry logic in
extract_info_from_video_page
and got dropped in https://github.com/openzim/ted/pull/130/files when adapting to new DOM.I think we should just restore this functionality by again pausing 5 secs and trying again up to 5 times, just like in 2.10.0.