openlink / virtuoso-opensource

Virtuoso is a high-performance and scalable Multi-Model RDBMS, Data Integration Middleware, Linked Data Deployment, and HTTP Application Server Platform
https://vos.openlinksw.com
Other
867 stars 210 forks source link

JDBC/Jena/Sesame RDF bulk loader should not be string based #155

Open JervenBolleman opened 10 years ago

JervenBolleman commented 10 years ago

Currently the RDF bulk load operations in virtuoso are Turtle string based. i.e. a java RDF model is serialised into a turtle string. This turtle string is then parsed inside the database using the TTLP function to load the data in to the rdf_quad and rdf_obj tables.

I suggest that instead of sending a String to be parsed we send 4 arrays (or 5) instead. The first array is an array of subjects (uri/bnodes) second predicates (uri) third uri/bnode objects fourth literal objects (may be merged with third) fifth bnode/uri for graph context.

Being able to send such a structured format to the database avoid not just parsing, but also gives the possibility for vectored loading. Each of these arrays of values can be replaced by rdf_obj ids in parallel. This allows you to build up a page for the rdf_quad table. In general avoiding the serial CPU load off parsing the turtle string.

HughWilliams commented 10 years ago

Hi Jerven, Good to meet you at the LDBC TUC meeting. As discussed with Orri we shall implement this feature enhancement for bulk loading of RDF data. I shall notify you when it is available in the open source git repo ...

HughWilliams commented 10 years ago

Hi Jerven, Good to meet you at the LDBC TUC meeting. As discussed with Orri we shall implement this feature enhancement for bulk loading of RDF data. I shall notify you when it is available in the open source git repo ...

HughWilliams commented 10 years ago

Actually in speaking to Orri this morning, this is not a feature enhancement, as the function/procedure already exists he just needs to provide instructions on usage, which he indicated will be provided tomorrow ...

JervenBolleman commented 10 years ago

The feature is probably there on the database side. However, it does require a rather significant improvement to the JDBC drivers. As currently the JDBC connection method createArrayOf is not implemented. i.e. from the java side it will be really hard to use. see libsrc/JDBCDriverType4/virtuoso/jdbc2

public Array createArrayOf(String typeName, Object[] elements) throws SQLException
{
   throw new VirtuosoFNSException ("createArrayOf(typeName, elements)  not supported",   VirtuosoException.NOTIMPLEMENTED);
}
HughWilliams commented 10 years ago

OK, I think Orri was assuming you would call the Virtuoso server side procedure directly, but if Java has a createArrayOf method for this already then we use it for implementation ...

JervenBolleman commented 10 years ago

The main issue is getting the data from the java side via the driver into virtuoso. Most of the LOB or Array methods that one would normally use are not reachable for pure JDBC code.