rhgarcia / tropescraper

A tropes scraper
GNU Lesser General Public License v3.0
30 stars 10 forks source link

Inexistent pages create an infinite loop #22

Open JJ opened 3 years ago

JJ commented 3 years ago

For instance, this one:

INFO:tropescraper.adaptors.file_cache:Cache miss for https://tvtropes.org/pmwiki/pmwiki.php/Main/FanfomVIP
DEBUG:tropescraper.adaptors.web_page_retriever:Retrieve URL from TVTropes: https://tvtropes.org/pmwiki/pmwiki.php/Main/FanfomVIP

Tries to find it, it's not there, comes back empty, it's not marked as non-existent. Something like that, I don't know.

JJ commented 3 years ago

The problem is that it's not so easy to find out which ones are false or non-existent. It returns an actual page. So it's not clear if it's recursing or simply leaving these for last.

JJ commented 3 years ago

I don't think now this is actually the case. It's true there's no easy way to check whether a page is an actual trope or an error page, since it does not return an 404, far as I can tell.