iovka / shex-java

Validation of Shape Expression Schemas
GNU Lesser General Public License v3.0
10 stars 6 forks source link

ShEx java OutOfMemoryError: Java heap space #12

Open ElwinHuaman opened 4 years ago

ElwinHuaman commented 4 years ago

Dear all,

I am testing ShExjava with a large-scale n-quad dataset, and I got some issues.

Context: I am using Intel Core i7-8550U CPU1.80Ghz (4 cores), 16GB of RAM, using Windows 10 64-bit, Java 9.0.4, and Eclipse IDE 4.10. my run configuration has VM: -Xmx12G

ShExMain.java `... Model data = Rio.parse(new FileInputStream(dataFile.toFile()), baseIRI, RDFFormat.NQUADS);
Graph dataGraph = factory.asGraph(data); ...

String shMap = "{FOCUS a http://schema.org/CreativeWork}@http://example.org/NameShape"; ... try {
BaseShapeMap shapeMap = parser.parse(new ByteArrayInputStream(shMap.getBytes())); RecursiveValidationWithMemorization algo = new RecursiveValidationWithMemorization(schema, dataGraph); ResultShapeMap result = algo.validate(shapeMap); } catch ( Exception e) {e.printStackTrace(); }`

Issues: ShEx OutOfMemoryError: Java heap space

Question: There is a proper(special) setup of ShEx for validating e.g., 1 billion of n-quads? How many shapeMaps supports ShEx? What is the file size/#triples/#n-quads that ShEx supports? Under what configuration ShEx runs ideally? How can someone scale ShEx approach to large-scale knowledge bases? (e.g., validating constraints directly agains SPARQL endpoints) Is better PyShEx/ShEx scale/ShEx.js/ShExjava or?

Thank you so much for your time, please forgive if I wrote/stated something wrong.

Best regards, Elwin

iovka commented 4 years ago

Dear Elwin,

There is a proper(special) setup of ShEx for validating e.g., 1 billion of n-quads? What is the file size/#triples/#n-quads that ShEx supports? Under what configuration ShEx runs ideally?

We have never tested ShExjava with this perspective. It currently uses in-memory validation that stores the whole result shape map. The limit is the available memory for storing the shape map during validation.

It is possible to adapt the validation algorithm so that it won't run into out of memory error (provided data and schema are not extremely recursive, which is a reasonable assumption). I will make a quick modification in that direction and come back to you when I have something running so that you can test it on your data.

How many shapeMaps supports ShEx? Not sure I understand the question. A shape map is a set of pairs of the form (graph node, shape), and there is only one such during validation.

How can someone scale ShEx approach to large-scale knowledge bases? (e.g., validating constraints directly agains SPARQL endpoints)

We have used (partial) validation through SPARQL endpoint in the ShapeDesigner project https://gitlab.inria.fr/jdusart/shexjapp Basically all you need is to implement the org.apache.commons.rdf.api.Graph interface. For instance class fr.inria.shapedesigner.control.GraphStore in the above project does this. It additionally caches the subgraph visited by SPARQL queries, which you probably won't do if your graph is very big.

Is better PyShEx/ShEx scale/ShEx.js/ShExjava or?

I have initiated the ShExjava project, so for me ShExjava is the best :) One of the strong points of ShExjava is the use of efficient algorithms for validation, such as the validation algorithm with memorization which is really not trivial. All of them support the shex specification. As far as I know, none of them has been tested with huge amounts of data so I cannot tell you which one would be better in your case.

Please feel free to react and ask me further questions if something was not clear.

Best regards, Iovka