openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 9 forks source link

Restore functionality to resist temporary bad TED responses when parsing video pages #209

Closed benoit74 closed 2 months ago

benoit74 commented 3 months ago

In order to retrieve video infos, TED scraper retrieves the video page with a URL like https://ted.com/talks/franco_sacchi_a_tour_of_nollywood_nigeria_s_booming_film_industry?language=nl and will look for __NEXT_DATA__ JSON inside the page, where it will find among other things the localized title and description.

This is done in extract_info_from_video_page function in scraper.py.

We currently have few recipes intermittently failing with an error An error occurred: 'NoneType' object has no attribute 'string'.

Looking at HTML content, there is no __NEXT_DATA__ JSON inside the page.

Loading again the page on my machine, there is __NEXT_DATA__ JSON.

So clearly the scraper should be more resilient to intermittent bad responses from TED server.

This was indeed the case in 2.10.0 where there was a retry logic in extract_info_from_video_page and got dropped in https://github.com/openzim/ted/pull/130/files when adapting to new DOM.

I think we should just restore this functionality by again pausing 5 secs and trying again up to 5 times, just like in 2.10.0.

benoit74 commented 3 months ago

Moving this to 3.1.0, it is mostly straightforward to implement and seems to be impacting about 5-10% of the recipes randomly.

benoit74 commented 2 months ago

Since we have currently no plan on when we will be able to work on 3.1.0 and since this bug makes the success of https://farm.openzim.org/recipes/ted_topic_all mostly impossible, I'm going to make a patch release 3.0.3