Build should be deterministic; same inputs should always generate same output.
Nico wrote:
Is there any way we can get the serialisation to be deterministic? It would be so cool if omim/data/omim_new.ttl would only change when running the pipeline if, and only if, there is a change, but when I ran it now, I got a few thousand changes like this:
It looks like the issue is that the BNode IDs are random by design. Here's what runs when a BNode is initialized:
def _serial_number_generator():
"""
Generates UUID4-based but ncname-compliant identifiers.
"""
from uuid import uuid4
def _generator():
return uuid4().hex
return _generator
I think I'll need to make a new class inheriting from the BNode class in the rdflib.term module. What I'm thinking is that, since these BNodes appear to be tied to something concrete (OMIM codes, I think), I can instead override _serial_number_generator() to use a hashing algorithm, e.g. md5 with the OMIM code as input.
Description
Build should be deterministic; same inputs should always generate same output.
Nico wrote:
Originally posted by @matentzn in https://github.com/monarch-initiative/omim/issues/13#issuecomment-925642681
Possible solutions
It looks like the issue is that the
BNode
IDs are random by design. Here's what runs when a BNode is initialized:I think I'll need to make a new class inheriting from the
BNode
class in therdflib.term
module. What I'm thinking is that, since these BNodes appear to be tied to something concrete (OMIM codes, I think), I can instead override_serial_number_generator()
to use a hashing algorithm, e.g.md5
with the OMIM code as input.