openzim / librechef

Sushi Chef script for importing sushi-chef-libretext content
MIT License
1 stars 0 forks source link

Do not fetch twice the same link in a given topic #32

Closed benoit74 closed 3 weeks ago

benoit74 commented 1 month ago

When scraping a topic like https://geo.libretexts.org/Courses/California_State_University_Los_Angeles, we try to find all links that have to be explored ("Book: An Introduction to Geology (Johnson, Affolter, Inkenbrandt, and Mosher)", "Front Matter", ...).

We currently encounter duplicates in the list of links generated as "to be processed".

The problem is that we search links to explore by searching for <a> tags (see https://github.com/openzim/librechef/blob/fe4631657e54dad1ed28f00c278a68676f58946c/sushichef.py#L175-L183) and we have two <a> tags, one with the real title and one with the image.