vemonet / rdflib-endpoint

💫 Deploy SPARQL endpoints from RDFLib Graphs to serve RDF files, machine learning models, or any other logic implemented in Python
https://pypi.org/project/rdflib-endpoint
MIT License
72 stars 18 forks source link

Performance expectations? #22

Open Gerbert-Kaandorp opened 2 months ago

Gerbert-Kaandorp commented 2 months ago

Hi Vincent!

Me again :), thanks a lot for adding the types in the last release! It is working out great in my dev stack and I don't need a converter any more! 🎉🎉

So, I wanted to test the performance of this setup a bit. I downloaded this Pokemon dataset

https://triplydb.com/academy/pokemon

It is about 4.893 MB / ~29000 triplets And I am using the following function to insert them over http using the endpoint.

def upload_data_from_disk_multi_graph(url, filename='data/pokemon.trig', format="trig", batch_size=3000):
    ds = rdflib.Dataset()
    ds.parse(filename, format=format)

    # Iterate over each graph in the dataset
    for graph in ds.graphs():
        graph_uri = graph.identifier
        batches = []
        batch = []

        # Prepare batches of triples
        for s, p, o in graph:
            batch.append((s, p, o))
            if len(batch) >= batch_size:
                batches.append(batch)
                batch = []
        if batch:
            batches.append(batch)

        # Execute batch inserts for each batch
        for batch in batches:
            insert_query = f"INSERT DATA {{ GRAPH <{graph_uri}> {{ "
            for s, p, o in batch:
                # Ensure proper serialization of objects into N3 format
                s_n3 = s.n3() if isinstance(s, URIRef) else f"<{s}>"
                p_n3 = p.n3() if isinstance(p, URIRef) else f"<{p}>"
                o_n3 = o.n3()
                insert_query += f"{s_n3} {p_n3} {o_n3} . "
            insert_query += "}}"
            print(f"Executing batch insert for graph {graph_uri}: {len(insert_query)} characters.")

            response = requests.post(url, data={'update': insert_query}, headers={'Accept': 'application/ld+json'})
            print("Response Status:", response.status_code)

Turns out, this is extremely slow. :(

And I am not sure if I am even using the api the right way Do you know what I am doing wrong? Or is this performance normal for using rdflib?

Thanks for reading. Gerbert

vemonet commented 1 month ago

Hi @Gerbert-Kaandorp, we are just executing the provided update query using:

parsed_update = prepareUpdate(update_query, initNs=graph_ns)
self.graph.update(parsed_update, "sparql")

So I guess this is just RDFLib not being really fast to insert data through update queries

And in general I don't think using INSERT DATA is a fast way to load a lot of data for any triplestore (usually they provide another call specifically to bulk load turtle/xml files, which we could also do relatively easily here by adding a call that takes a RDF file, and parse it into the graph used by the endpoint). Using INSERT DATA is more aimed at making small changes on the fly from an application (adding few dozen/hundred of triples)

If you have control over the server where you deploy the endpoint, then the recommended way is just to parse the file you want to load with RDFLib, then use this graph when instantiating the SparqlEndpoint