Standardize a bulk loading API

JervenBolleman commented 3 years ago

Why?

There is a lot of public data available as RDF but not always available via SPARQL endpoint. It would be nice if the scripts to bulk load such RDF into a store can be shared between vendors. This way a data provider can easily tell people how to load larger datasets.

Previous work

Proposed solution

A single executable (per vendor/sparql db) that consumes two files. The first configures the database connection or the whole database even. The second contains a description of the files (IRIs) to load.

./loader configuration.ttl datadiscriptiontoload.ttl .

A description of the files to load should be in RDF for extensibility .

prefix loader:<TO BE DETERMINED>
prefix vendor:<TO BE DETERMINED>

<file:///input/myfile1.ttl> a loader:File ;
 loader:intoGraph <http://example.org/graph> .
<https:///intranet.ofmy.org/input/myfile2.rdf> a loader:Download .
<file:///input/myfile1-pipe.ttl> a loader:Named_Pipe ;
   vendor:sortOrder vendor:PGSO .

The loader program instructs the sparql store how to actually load the data. This is vendor dependant logic and may take extra information regarding the infrastructure into account.

The configuration file will probably be completely custom to the vendor/product.

Ideas giving information regarding the nature of the IRIs to be loaded and how the database/loader should interact with them.

loader:File rdfs:comment "The IRI to be loaded is a file that allows random access." .
loader:Download rdfs:comment "The IRI to be loaded is should be attempted to be retrieved by the database". .
loader:Named_Pipe rdfs:comment "The IRI to be loaded is a named pipe and may only be read once in the forward direction" .
loader:Download_And_Push rdfs:comment "The IRI to be loaded is should be attempted to be retrieved by the loader and streamed to the database as the database may not have access to the IRI itself." .

Considerations for backward compatibility

None

afs commented 3 years ago

See #56 -- "POST quads" (which is roughly "normal HTTP operations on a dataset").

VladimirAlexiev commented 3 years ago

IMHO such functionality is not so easy to standardize because there are many options

Various RDF formats
Various channels (as described above)
Various modes (eg as described below), including offline and online (specifying an offline "storage location" will be different from specifying an online "database")
Various repository options, such as indexing (see vendor:sortOrder vendor:PGSO above), entity id bit-size, reasoning mode, enabled plugins including secondary indexing, etc etc. Eg see GraphDB repo config
The proposed configuration.ttl should extend existing RDF-based configuration mechanisms:
- Jena assemblers (assembler.ttl) and text dataset assembler
- rdf4j,
- GraphDB configuration

GraphDB

Like most databases, GraphDB includes a variety of tools for loading data that are appropriate for different situations:

SPARQL endpoint: SPARQL Graph Protocol and Update operations
Workbench import a local or a remote RDF file: interactive upload from file or URL
Workbench import a server file: interactive loading of file already in local server directory
LoadRDF: fast: offline database, includes inference
Preload: fastest: offline database, excludes inference

In addition, GraphDB can push changes to Kafka, and a future version will have ingest from Kafka.

w3c / sparql-dev