organicmaps / wikiparser

Wikipedia parser that generates offline content embeddable into Organic Maps map mwm files
GNU Affero General Public License v3.0
9 stars 3 forks source link

Investigate articles without QID #24

Open newsch opened 11 months ago

newsch commented 11 months ago

The schema for the wikipedia enterprise dumps lists the QID field (main_entity) as optional.

All articles should have a QID, but apparently there are cases where they don't.

It's not just articles that are so minor they don't have a wikidata item. In the 20230801 dump for example, out of this sample of errors:

[2023-08-04T17:58:48Z INFO  om_wikiparser] Page without wikidata qid: "Wiriadinata Airport" (https://en.wikipedia.org/wiki/Wiriadinata_Airport)
[2023-08-04T17:59:11Z INFO  om_wikiparser] Page without wikidata qid: "Uptown (Brisbane)" (https://en.wikipedia.org/wiki/Uptown_(Brisbane))

Both articles were edited on 2023-07-31, around when the dump was created:

Is this the main cause of these cases, or is there something else?

Is there some data we can preserve across dumps to prevent this, like keeping old qid links if there is no current one?

newsch commented 10 months ago

Some more examples found while checking simplification:

[2023-08-11T15:15:52Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Springfield railway station (Scotland)" (https://en.wikipedia.org/wiki/Springfield_railway_station_(Scotland))
[2023-08-11T15:15:56Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Estevan Point" (https://en.wikipedia.org/wiki/Estevan_Point)
[2023-08-11T15:16:28Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Paredes Viejas Airport" (https://en.wikipedia.org/wiki/Paredes_Viejas_Airport)
[2023-08-11T15:16:32Z INFO  om_wikiparser::get_articles] Page without wikidata qid: "Magellan's Cross" (https://en.wikipedia.org/wiki/Magellan%27s_Cross)

The Springfield railway station (Scotland) was renamed on 2023-03-29, the content is the correct article html.

The Paredes Viejas Airport article was matched by "Marchigüe Paredes Viejas Airport", listed as a redirect in the 2023-04-01 dump. On 2023-03-24, the article was renamed from "Marchigüe Paredes Viejas Airport" to "Paredes Viejas Airport", and the corresponding wikidata item was updated. The article html was still relevant.

The Magellan's Cross and Estevan Point are different. Neither article was renamed around the time the dump was created, and the html in both is only the redirect page, not the main article content.

estevan_point.json.txt magellans_cross.json.txt

biodranik commented 10 months ago

Maybe we may report this issue to people from Wikipedia? Or tag one of them here?

Vuizur commented 2 months ago

Maybe relevant: A user on Wikipedia/Wiktionary has been trying to get Wikimedia to fix errors with the enterprise dumps (such as quite a lot of missing pages) for two years now or so: https://phabricator.wikimedia.org/p/jberkel/

It's still broken now... (On Wiktionary they even considered that the best way forward might be scraping all pages and had a decent proof of concept, but with their slight rate limiting it took more than 2 days IIRC.)

biodranik commented 2 months ago

@newsch what do our most recent logs show? Are our errors related to that issue?

newsch commented 1 month ago

The logs won't report this. I disregarded the missing pages issue initially, since the existing articles are left on the disk. The errors we log are from articles that aren't simplified, around 144 in the last run.

Outdated articles we can't handle right now, the idea I had was to sync article update time with the file metadata, and skip writing if it is older.

As for the duplicates, we have the <head> element heuristic right now but I don't think that catches everything. I need to do a run with debug logs to figure out what else we can do.