Closed kelson42 closed 1 year ago
For the moment, the only ways to identify such articles (I know about) are:
section_text pagepolicy
but not sure this class is not displayed for other reasons.@rgaudin Would that be an appropriate worflow:
Ah! Thanks for looking into this. Indeed, using Google translate, I now understand what's going on. Agrees with way forward
We'll have to make sure it fixes it properly for all languages though…
This ticket follows comment by @rgaudin at https://github.com/openzim/wikihow/issues/128#issuecomment-1135643216.
In a nutshell, the WikiHow API list articles which are not availble for us to scrape. This leads to HTTP 404 error at scraping time. The whole leads to confusion and complex HTTP error handling.
Based on article
Bhatura-zubereiten
of WikiHow in German which is listed by the API but not available to scrape, I can say that this article exists in the WikiHow DE database and as such is legitimate to be listed by the API call https://de.wikihow.com/api.php?action=query&generator=categorymembers&gcmtitle=Kategorie:Rezepte&gcmtype=page&prop=info&inprop=url&format=json&gcmcontinue=page%7C424553414e204c41444f4f205a55424552454954454e%7C67219.There is indeed a HTTP 404 error with https://de.wikihow.com/Bhatura-zubereiten because the page is still in Q&A process.
We should find a way to identify such articles and handle them properly, because this scenario is simply different that given just a wrong article name.