Better handle articles which ar in Q&A quarantine

kelson42 commented 2 years ago

This ticket follows comment by @rgaudin at https://github.com/openzim/wikihow/issues/128#issuecomment-1135643216.

In a nutshell, the WikiHow API list articles which are not availble for us to scrape. This leads to HTTP 404 error at scraping time. The whole leads to confusion and complex HTTP error handling.

Based on article Bhatura-zubereiten of WikiHow in German which is listed by the API but not available to scrape, I can say that this article exists in the WikiHow DE database and as such is legitimate to be listed by the API call https://de.wikihow.com/api.php?action=query&generator=categorymembers&gcmtitle=Kategorie:Rezepte&gcmtype=page&prop=info&inprop=url&format=json&gcmcontinue=page%7C424553414e204c41444f4f205a55424552454954454e%7C67219.

There is indeed a HTTP 404 error with https://de.wikihow.com/Bhatura-zubereiten because the page is still in Q&A process.

We should find a way to identify such articles and handle them properly, because this scenario is simply different that given just a wrong article name.

kelson42 commented 2 years ago

For the moment, the only ways to identify such articles (I know about) are:

Parse the HTML of https://de.wikihow.com/Bhatura-zubereiten and identify for example the sentence "Sorry, dieser Artikel wird zur Zeit von unserer Qualitätskontrolle überarbeitet".
Parse the HTML of https://de.wikihow.com/Bhatura-zubereiten and see if there is a class section_text pagepolicy but not sure this class is not displayed for other reasons.

kelson42 commented 2 years ago

@rgaudin Would that be an appropriate worflow:

Try to scrape
If 404, then check if this is not an article in Q&A
If "yes" just continue and ignore, otherwise increment the 404 counter

rgaudin commented 2 years ago

Ah! Thanks for looking into this. Indeed, using Google translate, I now understand what's going on. Agrees with way forward

rgaudin commented 1 year ago

We'll have to make sure it fixes it properly for all languages though…

openzim / wikihow

Better handle articles which ar in Q&A quarantine #130