uf6 / design

Openly designing data enrichment solutions
http://en.wikipedia.org/wiki/Uranium_hexafluoride
28 stars 0 forks source link

Experiments in SPARQL, or how I learned to stop worrying and name the graph. #6

Open pudo opened 9 years ago

pudo commented 9 years ago

So I’ve had the worst possible weekend, implementing a version of the grano API that is based on RDF/SPARQL. The RDF tooling for anything other than Java is rotten. If you want to use RDF, I would seriously look at something that runs on the JVM for server-side processing (Clojure, Scala…?).

All of that would be a nice challenge, but the result is incredibly slow: running a simple count query on my network entities on Jena Fuseki now takes 300-400ms, and that’s not even a large dataset (5k entities, something like 3k relationships). This remains pretty much the same if I use an in-memory server. It’s 3 seconds on dydra (the fuck?). I must be doing something seriously wrong, but I can’t figure out what - perhaps it’s related to named graphs.

In any case, I thought you might be interested in playing with the raw data - It’s a quarter million quads, modelled along the lines of what we discussed on in #2 and #3. Provenance graphs are UUIDs, everything else is in http://example/update-base/default.

pudo commented 9 years ago

Here's a sample SPARQL query, it's generated which gives it these weirdly-names labels:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX gd: <http://data.grano.cc/v1/>
PREFIX gf: <http://ns.grano.cc/v1/fields/>

SELECT ?root ?status_f66de9cdcc ?schemata_048d504a3d ?hidden_427d6a6016 ?name_fd4d44795e 
?label_4a212b3078 ?_any_d01be58ea7_name ?_any_d01be58ea7_value ?_any_d01be58ea7_graph 
?_any_d01be58ea7_source_url ?id_b79420d01c

WHERE { 
?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities .
OPTIONAL { ?root gf:status ?status_f66de9cdcc }
GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } ?_any_d01be58ea7_graph gf:isActive true . OPTIONAL { ?_any_d01be58ea7_graph dc:source ?_any_d01be58ea7_source_url } ?_any_d01be58ea7_attr a gd:attributes . ?_any_d01be58ea7_attr dc:identifier ?_any_d01be58ea7_name . ?root a ?schemata_048d504a3d . ?schemata_048d504a3d a gd:schemata . OPTIONAL { ?schemata_048d504a3d gf:isHidden ?hidden_427d6a6016 } OPTIONAL { ?schemata_048d504a3d dc:identifier ?name_fd4d44795e } OPTIONAL { ?schemata_048d504a3d <http://www.w3.org/2000/01/rdf-schema#label> ?label_4a212b3078 } OPTIONAL { ?root gf:id ?id_b79420d01c } 

{ SELECT DISTINCT ?root
WHERE { ?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities . OPTIONAL { ?root gf:status ?status_f66de9cdcc } GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } ?_any_d01be58ea7_graph gf:isActive true . OPTIONAL { ?_any_d01be58ea7_graph dc:source ?_any_d01be58ea7_source_url } ?_any_d01be58ea7_attr a gd:attributes . ?_any_d01be58ea7_attr dc:identifier ?_any_d01be58ea7_name . ?root a ?schemata_048d504a3d . ?schemata_048d504a3d a gd:schemata . OPTIONAL { ?schemata_048d504a3d gf:isHidden ?hidden_427d6a6016 } OPTIONAL { ?schemata_048d504a3d dc:identifier ?name_fd4d44795e } OPTIONAL { ?schemata_048d504a3d <http://www.w3.org/2000/01/rdf-schema#label> ?label_4a212b3078 } OPTIONAL { ?root gf:id ?id_b79420d01c } }

LIMIT 25 } }
jmatsushita commented 9 years ago

Thanks for sharing your adventures!

I think Jena is not meant for speed. Also we’re definitely reaching the limits of my practical experience! Maybe an index thing? It might be related to named graphs, not all stores are optimised for that. Funny enough when looking into this on StackOverflow I found out that Virtuoso’s Quad store is based on SQL ?! http://stackoverflow.com/questions/17719341/difference-between-virtuoso-native-rdf-quad-store-and-virtuoso-sql-based-rdf-tri/17720682#17720682. Also some interesting stuff there :

Benchmark related stuff:

From when I looked, the only very pretty good tooling with RDF was Ruby (Spira in particular I really liked : https://github.com/ruby-rdf/spira). I wouldn’t be surprised if stuff starts coming up in the Javascript arena too. I have an irrational dislike of Java… :)

Maybe @elf-pavlik or @lisp could help with the performance question?

lisp commented 9 years ago

i have looked closer at your query. there are two issues. first, i suspect the query expects the default dataset to include the named graphs. this is not the case with dydra. in order to apply a query to such a dataset, it should include

from <hrn:dydra:all>

to specify that intent.

second, we are working on changes to our control structures, with the unfortunate consequence that, at the moment, caches are disabled and the query set-up time is much higher than it should be. in this case a query (with the inclusive dataset specification) which has an actual execution time under 200ms has a set-up time ten times that.

pudo commented 9 years ago

@lisp many thanks for that analysis! For your reference, here's the actual COUNT query I was referring to:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX gd: <http://data.grano.cc/v1/>
PREFIX gf: <http://ns.grano.cc/v1/fields/>
SELECT COUNT(DISTINCT(?root))
WHERE { ?root gf:inProject <http://data.grano.cc/v1/projects/opennews2> . ?root a gd:entities . GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr ?_any_5b726eb44c_value } ?_any_5b726eb44c_graph gf:isActive true . OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url } ?_any_5b726eb44c_attr a gd:attributes . ?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name }
lisp commented 9 years ago

On 2014-08-25, at 20:34, Friedrich Lindenberg notifications@github.com wrote:

@lisp many thanks for that analysis! For your reference, here's the actual COUNT query I was referring to:

PREFIX dc: http://purl.org/dc/terms/ PREFIX gd: http://data.grano.cc/v1/ PREFIX gf: http://ns.grano.cc/v1/fields/ SELECT COUNT(DISTINCT(?root)) WHERE { ?root gf:inProject http://data.grano.cc/v1/projects/opennews2 . ?root a gd:entities . GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr ?_any_5b726eb44c_va! lue } ?_any_5b726eb44c_graph gf:isActive true . OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url } ?_any_5b726eb44c_attr a gd:attributes .

?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name }

i expect, this would need to declare the dataset as follows, as it intends to both incorporate the named graphs into the default graph and match each one separately

PREFIX dc: http://purl.org/dc/terms/ PREFIX gd: http://data.grano.cc/v1/ PREFIX gf: http://ns.grano.cc/v1/fields/ SELECT count(*) # COUNT(DISTINCT(?root)) from urn:dydra:all from named urn:dydra:named WHERE { ?root gf:inProject http://data.grano.cc/v1/projects/opennews2 . # 159 ?root a gd:entities . # 31 / 5951 GRAPH ?_any_5b726eb44c_graph { ?root ?_any_5b726eb44c_attr ?_any_5b726eb44c_value } ?_any_5b726eb44c_graph gf:isActive true . OPTIONAL { ?_any_5b726eb44c_graph dc:source ?_any_5b726eb44c_source_url } ?_any_5b726eb44c_attr a gd:attributes . ?_any_5b726eb44c_attr dc:identifier ?_any_5b726eb44c_name }

still, i am not clear, what you intend. it looks like you want to restrict the graphs, but somehow that restriction eliminates everything,

PREFIX dc: http://purl.org/dc/terms/ PREFIX gd: http://data.grano.cc/v1/ PREFIX gf: http://ns.grano.cc/v1/fields/ SELECT count(*) from urn:dydra:all from named urn:dydra:named WHERE { GRAPH ?_any_d01be58ea7_graph { ?root ?_any_d01be58ea7_attr ?_any_d01be58ea7_value } # 262025 ?_any_d01be58ea7_graph gf:isActive true . # 47848 }

in that, for your current dataset, the count here is zero, despite the respective statement pattern cardinality.

pudo commented 9 years ago

I've bloggered about this whole thing here: http://pudo.org/blog/2014/09/01/grano-linked-data.html

akuckartz commented 8 years ago

@pudo Still interested in resolving this issue?