Closed fititnt closed 1 year ago
To be clear we already have daily backups of the entire wiki so this is really about public exports not backups.
TBH, I am not certain there is a significant benefit to publishing wikibase data in a separate format (i.e. the dump of wikibase-namespaced pages). The current state can be easily retrieved via the standard MW API, and it will be easier to parse (instead of JSON wrapped in XML), and it is easier to keep up-to-date locally via the same api.
The current state can be easily retrieved via the standard MW API,
Which URL to have this?
The current state can be easily retrieved via the standard MW API, and it will be easier to parse (instead of JSON wrapped in XML) (...)
I believe we should go with what https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps (examples here and here) describe as "First, canonical RDF dumps using the Turtle and NTriples formats (...)", and not the "Warning: The format of the JSON data embedded in the XML dumps is subject to change without notice, and may be inconsistent between revisions. It should be treated as opaque binary data. ".
The problem is not even that the "JSON wrapped in XML" is sort of custom format, but the official documentation explicitly does not recommend it. Even the idea of have an dedicated Wikibase just to read this type of dump seems scaring when the documentation also says that it could be inconsistent between revisions because really would need to stay the same version of OpenStreetMap wiki.
Under this context, either the W3C stadandard formats .nt or .ttl still a good option for a RDF dump than the non canonical alternatives.
Daily wikibase dump now available here: https://wiki.openstreetmap.org/dump/
Done via: https://github.com/openstreetmap/chef/commit/ea15058604d7533aadc8f9381ac497b1e10897dd
nice!
I'm coming here here from Talk:Wiki also as requested there, I'm pinging @nyurik. However likely the documentation might already work, just not asked before.
This is just another type of dump, which I'm assuming is already installed. So the process is similar to how dumps for the entire wiki are done today, except by the fact that
extensions/Wikibase/repo/maintenance/dumpRdf.php
will generate a very small dump file, in RDF format. It does have not the history, so is small and ready to be imported in other tools directly.Current alternatives
Download item by item
One alternative download would be loop from https://wiki.openstreetmap.org/wiki/Special:EntityData/Q2.ttl until https://wiki.openstreetmap.org/wiki/Special:EntityData/Q20000.ttl (plus the Ps) (e.g ~20.000 requests). I done one script here https://wiki.openstreetmap.org/wiki/User:EmericusPetro/sandbox/Poor_mans_OpenStreetMap_Data_Items_dumper and defined an user agent to give hint about who's crawling the data, but with an 5 seconds delay, it takes over 27 hours to make to download the items (the merge back is quick). So, is not only slow, but if between the start and the end some item changed and contradict each other (beyond mere human error, just be not so lucky people fixing things in the time between of over a day), then this type of dump makes the data inconsistent. So, is not just slow, but has a chance to generate inconsistent data compared to a quick server-side dump.
If there better alternatives for export, I'm welcomed, but I have no idea how to download the OpenStreetMap data items (which sort of are TBox for OpenStreetMap data, without actually be the data itself).
Relationship with the Remove Wikibase extension from all OSM wikis #764
It was commented that the 764 could be a reason to not enable this. However, a priori, the
dumpRdf.php
backup script already is installed, just not enabled, which means it might not be complicated to just add to chef automation. So:In theory, the wiki backups (the
dump.xml.gz
5.0G file at https://wiki.openstreetmap.org/dump/) have the full history, so could recreate the Wikibase dump for Data Items (as long as the database for the extensions are still there). But even in this case, thedumpRdf.php
makes lives so easier if done at server side.