We previously checked if a curie prefix was in our namespace via (I wrote this):
if prefix not in self.namespace_manager.namespaces():
Which in rdflib called
def namespaces(self):
for prefix, namespace in self.store.namespaces():
namespace = URIRef(namespace)
yield prefix, namespace
This check was applied for every triple added
The issue here being:
This never evaluates as true
This check is creating 3x len(namespace) URIRef objects every time we call addTriple, and then the perf hit of running graph.bind, in which the default is to override the namespace
In rdflib5 creating URIRef objects are slightly more expensive, which was exaggerated by this line.
version 5 is still slower than 4, the panther ingest on v5.0 takes 2hr 44min, and on 4.2 it's 1 hr 55 min, but it probably makes sense to stay up to date with the latest rdflib to get the turtle curie improvements, @TomConlin your thoughts?
Upgrades rdflib back to version 5.0.0
We previously checked if a curie prefix was in our namespace via (I wrote this):
if prefix not in self.namespace_manager.namespaces():
Which in rdflib called
This check was applied for every triple added
The issue here being:
In rdflib5 creating URIRef objects are slightly more expensive, which was exaggerated by this line.
I removed this in https://github.com/monarch-initiative/dipper/pull/746/commits/8b00ab38d23b9ad0b02381765ef5f145199cedae and retesting rdflib 5 is now more comparable to rdflib 4.2.2
version 5 is still slower than 4, the panther ingest on v5.0 takes 2hr 44min, and on 4.2 it's 1 hr 55 min, but it probably makes sense to stay up to date with the latest rdflib to get the turtle curie improvements, @TomConlin your thoughts?