w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
124 stars 19 forks source link

Standardize a bulk loading API #149

Open JervenBolleman opened 3 years ago

JervenBolleman commented 3 years ago

Why?

There is a lot of public data available as RDF but not always available via SPARQL endpoint. It would be nice if the scripts to bulk load such RDF into a store can be shared between vendors. This way a data provider can easily tell people how to load larger datasets.

Previous work

Proposed solution

A single executable (per vendor/sparql db) that consumes two files. The first configures the database connection or the whole database even. The second contains a description of the files (IRIs) to load.

./loader configuration.ttl datadiscriptiontoload.ttl .

A description of the files to load should be in RDF for extensibility .

prefix loader:<TO BE DETERMINED>
prefix vendor:<TO BE DETERMINED>

<file:///input/myfile1.ttl> a loader:File ;
 loader:intoGraph <http://example.org/graph> .
<https:///intranet.ofmy.org/input/myfile2.rdf> a loader:Download .
<file:///input/myfile1-pipe.ttl> a loader:Named_Pipe ;
   vendor:sortOrder vendor:PGSO .

The loader program instructs the sparql store how to actually load the data. This is vendor dependant logic and may take extra information regarding the infrastructure into account.

The configuration file will probably be completely custom to the vendor/product.

Ideas giving information regarding the nature of the IRIs to be loaded and how the database/loader should interact with them.

loader:File rdfs:comment "The IRI to be loaded is a file that allows random access." .
loader:Download rdfs:comment "The IRI to be loaded is should be attempted to be retrieved by the database". .
loader:Named_Pipe rdfs:comment "The IRI to be loaded is a named pipe and may only be read once in the forward direction" .
loader:Download_And_Push rdfs:comment "The IRI to be loaded is should be attempted to be retrieved by the loader and streamed to the database as the database may not have access to the IRI itself." .

Considerations for backward compatibility

None

afs commented 3 years ago

See #56 -- "POST quads" (which is roughly "normal HTTP operations on a dataset").

VladimirAlexiev commented 3 years ago

IMHO such functionality is not so easy to standardize because there are many options

GraphDB

Like most databases, GraphDB includes a variety of tools for loading data that are appropriate for different situations:

In addition, GraphDB can push changes to Kafka, and a future version will have ingest from Kafka.