openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
99 stars 12 forks source link

Publish data dumps of Wikibase data from the OSM Wiki #779

Closed fititnt closed 1 year ago

fititnt commented 1 year ago

TL;DR: The proposal here is to enable https://wiki.openstreetmap.org/wiki/Data_items/Technical_Site_Configuration#TODO, in addition to today https://wiki.openstreetmap.org/wiki/Wiki#Wiki_Dumps_/_Export. The expected compressed dump is likely to stay around 8 to 20 megabytes (~50 Ps + 20.000 Qs) and take less than a minute if done from server side.


I'm coming here here from Talk:Wiki also as requested there, I'm pinging @nyurik. However likely the documentation might already work, just not asked before.

This is just another type of dump, which I'm assuming is already installed. So the process is similar to how dumps for the entire wiki are done today, except by the fact that extensions/Wikibase/repo/maintenance/dumpRdf.php will generate a very small dump file, in RDF format. It does have not the history, so is small and ready to be imported in other tools directly.

Current alternatives

Download item by item

One alternative download would be loop from https://wiki.openstreetmap.org/wiki/Special:EntityData/Q2.ttl until https://wiki.openstreetmap.org/wiki/Special:EntityData/Q20000.ttl (plus the Ps) (e.g ~20.000 requests). I done one script here https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Poor_mans_OpenStreetMap_Data_Items_dumper and defined an user agent to give hint about who's crawling the data, but with an 5 seconds delay, it takes over 27 hours to make to download the items (the merge back is quick). So, is not only slow, but if between the start and the end some item changed and contradict each other (beyond mere human error, just be not so lucky people fixing things in the time between of over a day), then this type of dump makes the data inconsistent. So, is not just slow, but has a chance to generate inconsistent data compared to a quick server-side dump.

For who wants to check an example of download (contains all translations and etc): https://gist.github.com/fititnt/b1c8962f21d60433c2ca857f912d2fa8/archive/main.zip . One major difference between these dumps and the automated server side is that this was Turtle pretty printed (less disk space, sorted data, etc, something editable by hand). The P.ttlhad 320 kb and Q.ttl 49.0 MB. dump.log.tsv just list items deleted/removed.

If there better alternatives for export, I'm welcomed, but I have no idea how to download the OpenStreetMap data items (which sort of are TBox for OpenStreetMap data, without actually be the data itself).

Relationship with the Remove Wikibase extension from all OSM wikis #764

It was commented that the 764 could be a reason to not enable this. However, a priori, the dumpRdf.php backup script already is installed, just not enabled, which means it might not be complicated to just add to chef automation. So:

In theory, the wiki backups (the dump.xml.gz 5.0G file at https://wiki.openstreetmap.org/dump/) have the full history, so could recreate the Wikibase dump for Data Items (as long as the database for the extensions are still there). But even in this case, the dumpRdf.php makes lives so easier if done at server side.

tomhughes commented 1 year ago

To be clear we already have daily backups of the entire wiki so this is really about public exports not backups.

nyurik commented 1 year ago

TBH, I am not certain there is a significant benefit to publishing wikibase data in a separate format (i.e. the dump of wikibase-namespaced pages). The current state can be easily retrieved via the standard MW API, and it will be easier to parse (instead of JSON wrapped in XML), and it is easier to keep up-to-date locally via the same api.

fititnt commented 1 year ago

The current state can be easily retrieved via the standard MW API,

Which URL to have this?

The current state can be easily retrieved via the standard MW API, and it will be easier to parse (instead of JSON wrapped in XML) (...)

I believe we should go with what https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps (examples here and here) describe as "First, canonical RDF dumps using the Turtle and NTriples formats (...)", and not the "Warning: The format of the JSON data embedded in the XML dumps is subject to change without notice, and may be inconsistent between revisions. It should be treated as opaque binary data. ".

The problem is not even that the "JSON wrapped in XML" is sort of custom format, but the official documentation explicitly does not recommend it. Even the idea of have an dedicated Wikibase just to read this type of dump seems scaring when the documentation also says that it could be inconsistent between revisions because really would need to stay the same version of OpenStreetMap wiki.

Under this context, either the W3C stadandard formats .nt or .ttl still a good option for a RDF dump than the non canonical alternatives.

Firefishy commented 1 year ago

Daily wikibase dump now available here: https://wiki.openstreetmap.org/dump/

Done via: https://github.com/openstreetmap/chef/commit/ea15058604d7533aadc8f9381ac497b1e10897dd

nyurik commented 1 year ago

nice!