Closed kelson42 closed 9 months ago
First obvious thing is that the filtering by language does not work anymore.
It however seems to be linked to a change in UI (as far as I remember, the UI was not like this last time I visited the website), so I'm not sure the rest will work either.
I confirm that not supplying a --languages
, the scraper achieves to retrieve the page but does not achieve to parse it correctly.
A second issue (hidden) is that the topic page (e.g. https://www.ted.com/talks?sort=relevance&topics%5B0%5D=Design) does not accept a page
parameter anymore. One has to click on "Show more" to load more videos. This won't work with urllib / requests.
Are you aware of any new way to retrieve this list of videos filtered by topic ?
It looks like we could plug directly to the underlying API used on the page, even if this is probably as fragile as parsing the HTML.
I believe the scraper uses both ; because the internal API was introduced later and some info were easier to access from it but it already changed in the past (hence the emphasis on internal).
Still appears to be a better strategy than the DOM. It's understood that those scrapers are fragile and as long as it doesn't change multiple times per day, it's an acceptable effort to adapt.
I did not found any reference to an internal API in current codebase, do you remember what it was used for (I probably simply missed it).
I only found scraping of the playlists or tasks page + using JSON found in every video page in a special <script>
tag.
Note that we should probably not fix this until discussion on #150 has settled.
I only found scraping of the playlists or tasks page + using JSON found in every video page in a special Githubissues.
Githubissues is a development platform for aggregating issues.
See https://farm.openzim.org/recipes?category=ted