Virtuoso wikidata import performance - virtuoso wikidata endpoints as part of snapquery wikidata mirror network

openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform

https://vos.openlinksw.com

Other

864 stars 210 forks source link

Virtuoso wikidata import performance - virtuoso wikidata endpoints as part of snapquery wikidata mirror network #1326

Open WolfgangFahl opened 20 hours ago

WolfgangFahl commented 20 hours ago

@TallTed

Tim Holzheim has successfully imported Wikidata into a virtuoso instance see https://cr.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso and https://wiki.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso

for the documentation. The endpoint is available at https://virtuoso.wikidata.dbis.rwth-aachen.de/sparql/ and we would love to integrate this an other virtuoso endpoints into our snapquery https://github.com/WolfgangFahl/snapquery infrastructure.

Ted suggested that i should open a ticket to get the dicussion going about how virtuoso endpoints could be made part of the snapquery wikidata mirror infrastructure. The idea is to use named parameterized queries that hide the details of the endpoints so that it does not matter wether you use blazegraph, qlever, jena, virtuoso, stardog, ... you name it. Queries should just work as specified and be monitored for non functional aspects proactively.

TallTed commented 11 hours ago

Note that we (OpenLink Software [1], [2]) have also loaded Wikidata into a live Virtuoso instance, available at https://wikidata.demo.openlinksw.com/sparql.

I'm not sure whether I'm the "Ted" referenced in the last paragraph; if so, regrettably, I've forgotten the specifics of that conversation. Could you provide more detail about the "question" being asked by this issue, especially to benefit others who may have more to contribute to the "answer" than I?

WolfgangFahl commented 11 hours ago

https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours has the info as well as https://www.wikidata.org/wiki/Wikidata:Scholia/Events/Hackathon_October_2024

We are well aware of the virtuoso endpoint it is already configured in the default https://github.com/WolfgangFahl/snapquery/blob/main/snapquery/samples/endpoints.yaml file.

The question here is how do we get a virtuoso endpoint that is as up-to-date as possible quickly. We intent to "rotate" images based on dumps as long as the streaming updates are not possible. So currently that would be roughly weekly. E.g. https://github.com/ad-freiburg/qlever-control/discussions/82

is an example. This is just the initial issued to start the communication. Depending on how Virtuoso is going to be involved we might need multiple tickets for the different aspects. I suggest to stick with the import performanc issue in this ticket for the time being and wait for Tim's comment.