openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 9 forks source link

TED recipes based on `--topics` are not working anymore #149

Closed kelson42 closed 9 months ago

kelson42 commented 9 months ago

See https://farm.openzim.org/recipes?category=ted

[ted2zim::2023-11-29 04:01:10,401] INFO:Starting scraper with:
  langs: en
  subtitles : all
  video format : webm
[ted2zim::2023-11-29 04:01:10,401] INFO:Testing S3 Optimization Cache credentials
[ted2zim::2023-11-29 04:01:11,732] INFO:Using cache: s3.us-west-1.wasabisys.com with bucket: org-kiwix-ted
[ted2zim::2023-11-29 04:01:11,732] DEBUG:Fetching video links for topic: Business
[ted2zim::2023-11-29 04:01:11,733] DEBUG:generate_search_result_and_scrape: https://ted.com/talks?topics%5B%5D=Business&language=en&page=1
[ted2zim::2023-11-29 04:01:13,359] DEBUG:0 video(s) found on current page
[ted2zim::2023-11-29 04:01:13,359] INFO:Total video links found in Business: 0
[ted2zim::2023-11-29 04:01:13,359] ERROR:FAILED. An error occurred: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
[ted2zim::2023-11-29 04:01:13,359] ERROR:No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/entrypoint.py", line 190, in main
    scraper.run()
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1058, in run
    self.remove_failed_topics_and_check_extraction(failed)
  File "/usr/local/lib/python3.8/site-packages/ted2zim-2.0.13-py3.8.egg/ted2zim/scraper.py", line 1027, in remove_failed_topics_and_check_extraction
    raise ValueError(
ValueError: No videos found for any topic in the language(s) requested. Check topic(s) and/or language code supplied to --languages
benoit74 commented 9 months ago

First obvious thing is that the filtering by language does not work anymore.

It however seems to be linked to a change in UI (as far as I remember, the UI was not like this last time I visited the website), so I'm not sure the rest will work either.

benoit74 commented 9 months ago

I confirm that not supplying a --languages, the scraper achieves to retrieve the page but does not achieve to parse it correctly.

A second issue (hidden) is that the topic page (e.g. https://www.ted.com/talks?sort=relevance&topics%5B0%5D=Design) does not accept a page parameter anymore. One has to click on "Show more" to load more videos. This won't work with urllib / requests.

Are you aware of any new way to retrieve this list of videos filtered by topic ?

It looks like we could plug directly to the underlying API used on the page, even if this is probably as fragile as parsing the HTML.

rgaudin commented 9 months ago

I believe the scraper uses both ; because the internal API was introduced later and some info were easier to access from it but it already changed in the past (hence the emphasis on internal).

Still appears to be a better strategy than the DOM. It's understood that those scrapers are fragile and as long as it doesn't change multiple times per day, it's an acceptable effort to adapt.

benoit74 commented 9 months ago

I did not found any reference to an internal API in current codebase, do you remember what it was used for (I probably simply missed it).

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script> tag.

Note that we should probably not fix this until discussion on #150 has settled.

rgaudin commented 9 months ago

I only found scraping of the playlists or tasks page + using JSON found in every video page in a special Githubissues.

  • Githubissues is a development platform for aggregating issues.