openzim / wikihow

WikiHow scraper
https://download.kiwix.org/zim/wikihow/
GNU General Public License v3.0
15 stars 2 forks source link

Better handle articles which ar in Q&A quarantine #130

Closed kelson42 closed 1 year ago

kelson42 commented 2 years ago

This ticket follows comment by @rgaudin at https://github.com/openzim/wikihow/issues/128#issuecomment-1135643216.

In a nutshell, the WikiHow API list articles which are not availble for us to scrape. This leads to HTTP 404 error at scraping time. The whole leads to confusion and complex HTTP error handling.

Based on article Bhatura-zubereiten of WikiHow in German which is listed by the API but not available to scrape, I can say that this article exists in the WikiHow DE database and as such is legitimate to be listed by the API call https://de.wikihow.com/api.php?action=query&generator=categorymembers&gcmtitle=Kategorie:Rezepte&gcmtype=page&prop=info&inprop=url&format=json&gcmcontinue=page%7C424553414e204c41444f4f205a55424552454954454e%7C67219.

There is indeed a HTTP 404 error with https://de.wikihow.com/Bhatura-zubereiten because the page is still in Q&A process.

We should find a way to identify such articles and handle them properly, because this scenario is simply different that given just a wrong article name.

kelson42 commented 2 years ago

For the moment, the only ways to identify such articles (I know about) are:

kelson42 commented 2 years ago

@rgaudin Would that be an appropriate worflow:

rgaudin commented 2 years ago

Ah! Thanks for looking into this. Indeed, using Google translate, I now understand what's going on. Agrees with way forward

rgaudin commented 1 year ago

We'll have to make sure it fixes it properly for all languages though…