openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
854 stars 211 forks source link

NQuad Ingestion via REST #1142

Closed eltonfss closed 1 year ago

eltonfss commented 1 year ago

Is there any way to ingest a payload in NQuad format via REST in Virtuoso?

According to https://etl.linkedpipes.com/tutorials/how-to/load_data_to_virtuoso:

... You can load data to Virtuoso by its bulk loader. This is a Virtuoso-specific solution. However, the standards-based SPARQL 1.1 Graph Store HTTP Protocol does not support loading data in quad-based RDF syntaxes that specify named graphs. Instead, named graph IRIs must be provided by a query parameter. If you want to load multiple files into multiple named graphs, the Virtuoso bulk loader can do it for you ...

Nonetheless, I need to be able to send the Quads as a payload in REST request. As I haven't yet found an alternative in the official documentation and was wondering if this feature is indeed not available in Virtuoso. If not, it might be of interest to include this in the development roadmap, since that same feature is available in other triplestores, such as Jena Fuseki, AllegroGraph, BlazeGraph, GraphDB and RDFox.

HughWilliams commented 1 year ago

There are various methods of ingesting data into Virtuoso as detailed in this RDF Insert Methods in Virtuoso document. Via REST with the Virtuoso Sponger Middleware, RDF datasets (NQuad and other such) can be ingested directly from the SPARQL endpoint, Virtuoso Crawler, or RDF Sink Folders (Virtuoso or ODS-Briefcase).

We also have sample code on how to Bulk Load RDF datasets (NQuad and other such) using the RDF4J and Jena frameworks you might also want to review.

eltonfss commented 1 year ago

@HughWilliams Thank you for the prompt response! I've looked into the links you shared and configured the Virtuoso Sponger Middleware in my local deployment.

Unfortunately, even after installing it I get an error message informing the I cannot upload data in N-Quads format: image

I've also tried to make the import through the /sparql-graph-crud-auth but I get an equivalent error: image image

Packages installed: image

SPARQL Account: image

I also considered trying the RDF Sink Folders (Virtuoso or ODS-Briefcase) but I noticed that they require the data is placed in a file inside the server, which is not ideal for my use case (I ultimately need to send the NQuads Payload in the HTTP request body). Is that interpretation correct?

Using RDF4J or Jena as a "frontend" for Virtuoso might also make sense, but since I'm trying to evaluate multiple triplestores (which are all deployed using docker) I haven't yet been able to invest time into that. If there was some containerized version of this it would be very helpful (I've came across this one https://github.com/asanchez75/docker-rdf4j-virtuoso/blob/master/Dockerfile but since it was made a long time ago it might not be very reliable). Are you aware of a better/faster way to deploy one of these combined solutions as a docker image?

HughWilliams commented 1 year ago

We don't have a docker container image for the Virtuoso RDF4J HTTP Repository, thus if you wanted to use this it would have to be set manually for use or a docker container image created for the setup, if deployment via docker is required.

namedgraph commented 1 year ago

@HughWilliams could you add N-Quads support to the Graph Store Protocol? It would be non-standard as per SPARQL 1.1, but Jena supports it, and probably others. E.g. POST /sparql-graph-crud/ (without any graph param) would append quads to the dataset.

pkleef commented 1 year ago

We committed patch https://github.com/openlink/virtuoso-opensource/commit/0d2c90ca90ea39666fa8dbf747c12498e1a434da to the develop/7 branch to add N-QUADS support to the Graph Store Protocol.

namedgraph commented 1 year ago

@pkleef so is it full CRUD support or only POST?

I'm trying the following and getting 406 Unacceptable

curl -i --digest -u dba:dba http://localhost:9030/sparql-graph-crud-auth -H "Accept: application/n-quads"

It should work like this (and it does in Jena):

The same issue is described in https://github.com/w3c/sparql-dev/issues/56

kidehen commented 1 year ago

@namedgraph —

@pkleef so is it full CRUD support or only POST?

I'm trying the following and getting 406 Unacceptable

curl -i --digest -u dba:dba http://localhost:9030/sparql-graph-crud-auth -H "Accept: application/n-quads"

It should work like this (and it does in Jena):

  • GET returns the whole dataset as quads
  • POST appends quads to dataset
  • PUT replaces the whole dataset as quads
  • DELETE removes the whole dataset

The same issue is described in w3c/sparql-dev#56

Jena is a triple store while Virtuoso is a Quad Store. What you desire isn't as trivial as presented in a Quad Store that also includes fine-grained named graph scoped ACLs for security and data governance, etc.

What's been implemented for this non-standard extension is:

$ curl -i --digest -u dba:dba http://localhost:8890/sparql-graph-crud-auth \
  -X POST -H 'Content-Type: application/n-quads' --data-binary @test.nq

A default POST request can add triples to existing graphs specified in test.nq.

If you want to clean all the graphs referenced in your .nq file, you can use a PUT command, which will incorporate a with-delete operation similar to the existing bulk loader.

Feature restrictions

  1. Graphs referenced in the .nq file cannot be split over 2 or more files
  2. Graphs referenced should comprise triples tailored in size based on available memory

That's what's on offer for now, due to the non-standard nature of these extensions, etc. If a special-need implementation is required, that can be pursued as potential customer-specific custom development rather than a bug fix.

Usage example with current implementation

curl --digest -u dba:*** -i -X POST --data-binary @nq1.nq -HContent-Type:application/n-quads http://localhost:8890/sparql-graph-crud-auth/
<g1> 5 triples
<g2> 5 triples 

curl --digest -u dba:*** -i -X PUT --data-binary @nq2.nq -HContent-Type:application/n-quads http://localhost:8890/sparql-graph-crud-auth/
<g1> 4 triples
<g2> 3 triples 

Verification:
sparql select ?g count(*) { graph ?g { ?s ?p ?o } filter (?g in (<g1>,<g2>))};
namedgraph commented 1 year ago

Jena is a triple store while Virtuoso is a Quad Store

@kidehen that is incorrect and you know it. Jena is a quad store as well.

kidehen commented 1 year ago

@kidehen that is incorrect and you know it. Jena is a quad store as well.

Clearly I didn't know that, hence my inaccurate comment.

Following your response, I've now looked up Bing+ChatGPT for the latest description.

Jena TDB is both a triple store and a quad store. It can store and query RDF data as triples or quads. A triple is a statement that consists of a subject, a predicate, and an object. A quad is a statement that also includes a graph name, which can be used to group triples into different named graphs. [Jena TDB supports the full range of Jena APIs for working with triples and quads](about:blank#)². You can use the Dataset API to access and manipulate named graphs in Jena TDB⁴. You can also use SPARQL queries to select, construct, or update triples or quads from different graphs⁵. Jena TDB is a native high performance triple store that does not require any extra tool other than Jena Framework².

Source: Conversation with Bing, 9/6/2023 (1) Apache Jena - TDB. https://jena.apache.org/documentation/tdb/. (2) Apache Jena - Home. https://jena.apache.org/. (3) rdf - Persisting data in Jena TDB triple store - Stack Overflow. https://stackoverflow.com/questions/30682246/persisting-data-in-jena-tdb-triple-store. (4) Apache Jena - TDB Architecture. https://jena.apache.org/documentation/tdb/architecture.html. (5) GitHub - srdc/triplestore: Unified Triple Store Interface working with .... https://github.com/srdc/triplestore.

kidehen commented 1 year ago

A clearer response: Virtuoso implements Quad Storage via its core DBMS engine. It provides ACID for CRUD operations, and uses named-graph-scoped ACLs for fine-grained attribute-based access control (ABAC).

The fundamentals above impact its behavior, with regard to Quad Management and what is acceptable via its SPARQL Graph Protocol implementation.