monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

Update rdflib to version 5 #953

Closed kshefchek closed 3 years ago

kshefchek commented 3 years ago

Upgrades rdflib back to version 5.0.0

We previously checked if a curie prefix was in our namespace via (I wrote this): if prefix not in self.namespace_manager.namespaces():

Which in rdflib called

    def namespaces(self):
        for prefix, namespace in self.store.namespaces():
            namespace = URIRef(namespace)
            yield prefix, namespace

This check was applied for every triple added

The issue here being:

  1. This never evaluates as true
  2. This check is creating 3x len(namespace) URIRef objects every time we call addTriple, and then the perf hit of running graph.bind, in which the default is to override the namespace

In rdflib5 creating URIRef objects are slightly more expensive, which was exaggerated by this line.

I removed this in https://github.com/monarch-initiative/dipper/pull/746/commits/8b00ab38d23b9ad0b02381765ef5f145199cedae and retesting rdflib 5 is now more comparable to rdflib 4.2.2

version 5 is still slower than 4, the panther ingest on v5.0 takes 2hr 44min, and on 4.2 it's 1 hr 55 min, but it probably makes sense to stay up to date with the latest rdflib to get the turtle curie improvements, @TomConlin your thoughts?

TomConlin commented 3 years ago

I do prefer to avoiding lagging on key dependencies. Be prepared for on Tartini to be a fair bit slower.