Closed benoit74 closed 3 weeks ago
When scraping a topic like https://geo.libretexts.org/Courses/California_State_University_Los_Angeles, we try to find all links that have to be explored ("Book: An Introduction to Geology (Johnson, Affolter, Inkenbrandt, and Mosher)", "Front Matter", ...).
We currently encounter duplicates in the list of links generated as "to be processed".
The problem is that we search links to explore by searching for <a> tags (see https://github.com/openzim/librechef/blob/fe4631657e54dad1ed28f00c278a68676f58946c/sushichef.py#L175-L183) and we have two <a> tags, one with the real title and one with the image.
<a>
When scraping a topic like https://geo.libretexts.org/Courses/California_State_University_Los_Angeles, we try to find all links that have to be explored ("Book: An Introduction to Geology (Johnson, Affolter, Inkenbrandt, and Mosher)", "Front Matter", ...).
We currently encounter duplicates in the list of links generated as "to be processed".
The problem is that we search links to explore by searching for
<a>
tags (see https://github.com/openzim/librechef/blob/fe4631657e54dad1ed28f00c278a68676f58946c/sushichef.py#L175-L183) and we have two<a>
tags, one with the real title and one with the image.