schema parse problem - Githubissues

goodb commented 4 years ago

The following schema causes a problem for the GenParser.parseSchema method.

BASE   <http://purl.obolibrary.org/obo/go/shapes/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX directly_provides_input_for: <http://purl.obolibrary.org/obo/RO_0002413>
PREFIX regulates: <http://purl.obolibrary.org/obo/RO_0002211>
PREFIX positively_regulates: <http://purl.obolibrary.org/obo/RO_0002213>

<s1> {
}

<s2> {
}

<MolecularFunction> {
  directly_provides_input_for: @<MolecularFunction> *;
  regulates: @<MolecularFunction> *;
  positively_regulates: @<MolecularFunction> *;
} // rdfs:comment  "A molecular function"

It is strange. If you take out either of the empty shapes (s1, s2) or if you take out one of the constraints on the MolecularFunction shape, it parses almost instantly. As it is, I have left it running for more than 5 minutes and watched the java memory usage spike up above 4GB.

I have extracted this minimal example from the real schema our group is working on here: https://github.com/geneontology/go-shapes/blob/master/shapes/go-cam-shapes.shex

Help on this would be awesome.

jdusart commented 4 years ago

Ok, so I manage to reproduce the issues and found where the problem is. It is not in the parser but in the computation of the stratification (Line 703 in ShexSchema.java). There is an enumeration of path and it can be slow and costly when you have loops. This is a part of the code that I don't really like but I don't have a good solution to change it.

I will try to find a solution.

jdusart commented 4 years ago

I have release a new version (1.2.3c) where I changed the bound to 10. It is not perfect, since it does not fully check that the schema is stratified, but it should limit the cost and allow you to parse your schema.

After the holiday I will look for a better solution.

goodb commented 4 years ago

Thank you! It looks like it is working correctly and quickly even with the full schema now.

I anticipate the schema will continue to evolve over the coming months and we will be testing it with thousands of different RDF files. Perhaps this will produce some good additional test cases for your work. FYI we will also be doing some performance testing across the different libraries - shaclex in particular.

jdusart commented 4 years ago

Ok. We are interested by the performance result.

goodb commented 4 years ago

Informally, the test result is that the java implementation is substantially faster than the scala and python versions. Sorry we haven't done a good scientific comparison for this, but anecdotally it was enough to convince us to use the java for our production server for now.

(FYI the demonstrator service at http://shexjava.lille.inria.fr seems to be down).

iovka / shex-java

schema parse problem #8