Storage responds with HTTP 400 when creating large insert statements

Kleanthi commented 7 years ago

For example: ODIN's evaluation module will create a result that includes the following triple: http://w3id.org/hobbit/experiments#TPS_Observation1_for_123 http://w3id.org/hobbit/experiments#taskID "1"^^http://www.w3.org/2001/XMLSchema#integer

When the triple is sent to the evaluation storage, it will be transformed into http://w3id.org/hobbit/experiments#TPS_Observation1_for_123 http://w3id.org/hobbit/experiments#taskID 1

MichaelRoeder commented 7 years ago

When creating an example graph with some literals, the model looks like the following:

[http://ex.org/exp, http://ex.org/recall, "0.5"^^http://www.w3.org/2001/XMLSchema#double]
[http://ex.org/R1, http://ex.org/p, "4"^^http://www.w3.org/2001/XMLSchema#double]
[http://ex.org/R1, http://ex.org/p, "3"^^http://www.w3.org/2001/XMLSchema#integer]
[http://ex.org/R1, http://ex.org/p, "2.3"^^http://www.w3.org/2001/XMLSchema#double]
[http://ex.org/R1, http://ex.org/p, "1"^^http://www.w3.org/2001/XMLSchema#int]

After serializing and deserializing the graph, it is slightly changed:

[http://ex.org/exp, http://ex.org/recall, "5.0E-1"^^http://www.w3.org/2001/XMLSchema#double]
[http://ex.org/R1, http://ex.org/p, "1"^^http://www.w3.org/2001/XMLSchema#int]
[http://ex.org/R1, http://ex.org/p, "2.3E0"^^http://www.w3.org/2001/XMLSchema#double]
[http://ex.org/R1, http://ex.org/p, "3"^^http://www.w3.org/2001/XMLSchema#integer]
[http://ex.org/R1, http://ex.org/p, "4.0E0"^^http://www.w3.org/2001/XMLSchema#double]

However, the literal types are still part of the graph. They are remove when calling SparqlQueries.getUpdateQueryFromDiff which creates the following INSERT query

WITH <http://test.org/graph>
INSERT {
  <http://ex.org/exp> <http://ex.org/recall> 5.0E-1 .
  <http://ex.org/R1> <http://ex.org/p> 4.0E0 .
  <http://ex.org/R1> <http://ex.org/p> 3 .
  <http://ex.org/R1> <http://ex.org/p> 2.3E0 .
  <http://ex.org/R1> <http://ex.org/p> "1"^^<http://www.w3.org/2001/XMLSchema#int> .
}
WHERE
  {}

yamalight commented 7 years ago

Method that (possibly) creates the issue: https://github.com/hobbit-project/core/blob/master/src/main/java/org/hobbit/storage/queries/SparqlQueries.java#L566

yamalight commented 7 years ago

Code to recreate wrong INSERT query:

import org.apache.jena.datatypes.xsd.XSDDatatype;
import org.apache.jena.query.Dataset;
import org.apache.jena.query.DatasetFactory;
import org.apache.jena.query.QueryExecution;
import org.apache.jena.query.QueryExecutionFactory;
import org.apache.jena.rdf.model.Literal;
import org.apache.jena.rdf.model.Model;
import org.apache.jena.rdf.model.ModelFactory;
import org.apache.jena.rdf.model.StmtIterator;
import org.hobbit.core.rabbit.RabbitMQUtils;
import org.hobbit.storage.queries.SparqlQueries;

public class temp {

    public static void main(String[] args) throws Exception {
        Model model = ModelFactory.createDefaultModel();
        model.addLiteral(model.createResource("http://ex.org/R1"), model.createProperty("http://ex.org/p"), 1);
        model.addLiteral(model.createResource("http://ex.org/R1"), model.createProperty("http://ex.org/p"), 2.3);
        model.addLiteral(model.createResource("http://ex.org/R1"), model.createProperty("http://ex.org/p"),
                model.createTypedLiteral(3, XSDDatatype.XSDinteger));
        model.addLiteral(model.createResource("http://ex.org/R1"), model.createProperty("http://ex.org/p"),
                model.createTypedLiteral(4, XSDDatatype.XSDdouble));

        Literal macroAverageRecallLiteral = model.createTypedLiteral(0.5,
                XSDDatatype.XSDdouble);
        model.add(model.createResource("http://ex.org/exp"), model.createProperty("http://ex.org/recall"), macroAverageRecallLiteral);

        StmtIterator iter = model.listStatements();
        while (iter.hasNext()) {
            System.out.println(iter.next().toString());
        }

        byte[] serialized = RabbitMQUtils.writeModel(model);

        Model receivedModel = RabbitMQUtils.readModel(serialized);

        System.out.println("-------------------------------");
        iter = receivedModel.listStatements();
        while (iter.hasNext()) {
            System.out.println(iter.next().toString());
        }

        String query = SparqlQueries.getUpdateQueryFromDiff(ModelFactory.createDefaultModel(), receivedModel, "http://test.org/graph");
        Dataset dataset = DatasetFactory.create();
        dataset.addNamedModel("http://test.org/graph",  ModelFactory.createDefaultModel());

        QueryExecution qe = QueryExecutionFactory.create(query, dataset);
        Model result = qe.execConstruct();

        System.out.println("-------------------------------");
        iter = result.listStatements();
        while (iter.hasNext()) {
            System.out.println(iter.next().toString());
        }
    }
}

Kleanthi commented 7 years ago

Interesting observation: The configuration: number of INSERT queries = 10, population = 100 and number of DGs = 4 is successful and the results are stored without creating any issues. However, for the configuration number of INSERT queries = 100, population = 1000 and number of DGs = 4, the results can't be saved and observing the logs, all datatypes are omitted.

MichaelRoeder commented 7 years ago

That would mean, that the problem does not arose from the omitted datatypes, since the example above shows that datatypes are nearly always omitted.

MichaelRoeder commented 7 years ago

Yes, taking a look at the W3C recommendation for SPARQL https://www.w3.org/TR/rdf-sparql-query/ certain datatypes can be omitted (section 4.1.2)

1, which is the same as "1"^^xsd:integer
1.3, which is the same as "1.3"^^xsd:decimal
1.300, which is the same as "1.300"^^xsd:decimal
1.0e6, which is the same as "1.0e6"^^xsd:double

So the problem seems to be created by a SPARQL query that is too long.

Kleanthi commented 7 years ago

Most probably yes, the problem is the large query. I tried to perform a large INSERT query that can be generated by my benchmark against a local instance of virtuoso and virtuoso is not able to process the whole query and results into a syntax error.

MichaelRoeder commented 7 years ago

The reason for the error that you get when executing the query using the UI of virtuoso is that the UI is based on HTTP GET. This has a maximum length of parameters so your browser (or the virtuoso server) cuts off everything that is longer than the maximum length creating an incomplete SPARQL query.

However, our storage service uses a UpdateProcessRemote instance for the UPDATE (https://github.com/hobbit-project/platform/blob/master/platform-storage/storage-service/src/main/java/org/hobbit/storage/service/StorageService.java#L113). Following its javadoc, this creates a HTTP POST request which is not bound to the length limitation of a GET request.

That means that the problem can not be recreated by using the UI of virtuoso.

MichaelRoeder commented 7 years ago

failing SPARQL query

yamalight commented 7 years ago

The problem is virtuoso query length limit:

Virtuoso 37000 Error SP031: SPARQL: Internal error: The length of generated SQL text has exceeded 10000 lines of code

yamalight commented 7 years ago

New function List<String> getUpdateQueriesFromDiff in core should solve that:

    public static final String getUpdateQueryFromDiff(Model original, Model updated, String graphUri) {
        UpdateDeleteInsert update = new UpdateDeleteInsert();
        Node graph = null;
        if (graphUri != null) {
            graph = NodeFactory.createURI(graphUri);
            update.setWithIRI(graph);
        }
        StmtIterator iterator;

        // deleted statements
        Model temp = original.difference(updated);
        iterator = temp.listStatements();
        QuadAcc quads = update.getDeleteAcc();
        while (iterator.hasNext()) {
            quads.addTriple(iterator.next().asTriple());
        }

        // inserted statements
        temp = updated.difference(original);
        iterator = temp.listStatements();
        quads = update.getInsertAcc();
        while (iterator.hasNext()) {
            quads.addTriple(iterator.next().asTriple());
        }

        System.out.println(update.toString(original));
        return update.toString(original);
    }

    public static final String getUpdateQueryFromStatements(List<Statement> deleted, List<Statement> inserted, Model mapping, String graphUri) {
        UpdateDeleteInsert update = new UpdateDeleteInsert();
        Node graph = null;
        if (graphUri != null) {
            graph = NodeFactory.createURI(graphUri);
            update.setWithIRI(graph);
        }
        Iterator<Statement> iterator;

        // deleted statements
        iterator = deleted.iterator();
        QuadAcc quads = update.getDeleteAcc();
        while (iterator.hasNext()) {
            quads.addTriple(iterator.next().asTriple());
        }

        // inserted statements
        iterator = inserted.iterator();
        quads = update.getInsertAcc();
        while (iterator.hasNext()) {
            quads.addTriple(iterator.next().asTriple());
        }

        return update.toString(mapping);
    }

    public static final List<String> getUpdateQueriesFromDiff(Model original, Model updated, String graphUri) {
        List<String> results = new ArrayList<>();
        Model deleted = original.difference(updated);
        Model inserted = updated.difference(original);

        int totalSize = Math.toIntExact(deleted.size() + inserted.size());
        if (totalSize <= MAX_QUERY_TRIPLES) {
            String query = getUpdateQueryFromDiff(original, updated, graphUri);
            results.add(query);
            return results;
        }

        int pages = totalSize / MAX_QUERY_TRIPLES;
        for(int i = 0; i < pages; i++) {
            int startIndex = i * MAX_QUERY_TRIPLES;
            int endIndex = startIndex + MAX_QUERY_TRIPLES;
            List<Statement> delStatements = deleted.listStatements().toList().subList(startIndex, endIndex);
            List<Statement> addStatements = inserted.listStatements().toList().subList(startIndex, endIndex);
            String query = getUpdateQueryFromStatements(delStatements, addStatements, original, graphUri);
            results.add(query);
        }

        return results;
    }

MichaelRoeder commented 7 years ago

Implemented the requested changes in the core library. https://github.com/hobbit-project/core/commit/9caa4687b7b10b593fe127c8e489473f87eb27ff

Deployed the new version 1.0.3 of the library.

MichaelRoeder commented 7 years ago

For documentation: I got a mail from Milos pointing me to a discussion on a mailing list: http://dbpedia-discussion.narkive.com/dwQ2KVKS/sparql-endpoint-line-of-code-limit There, it is explained that 10000 lines of code are ~200K text.

MichaelRoeder commented 7 years ago

The problem still occurs - even with SPARQL insert statements that have only 700 triples. The introduced SPARQL INSERT query splitting is not used correctly in the platform controller.

yamalight commented 7 years ago

I'd suggest creating a test case query and sending it to OL guys

MichaelRoeder commented 7 years ago

They already send me the information they have about the problem.

The problem should be fixed with https://github.com/hobbit-project/core/commit/09041d3cec4a9dcf059f7714345e3a93a98900e1 but I couldn't test it, locally. Will do that ASAP on a larger machine.

hobbit-project / platform

Storage responds with HTTP 400 when creating large insert statements #41