monarch-initiative / omim

Data ingest pipeline for OMIM.
7 stars 3 forks source link

Build should be deterministic #15

Closed joeflack4 closed 2 years ago

joeflack4 commented 2 years ago

Description

Build should be deterministic; same inputs should always generate same output.

Nico wrote:

Is there any way we can get the serialisation to be deterministic? It would be so cool if omim/data/omim_new.ttl would only change when running the pipeline if, and only if, there is a change, but when I ran it now, I got a few thousand changes like this:

image

Originally posted by @matentzn in https://github.com/monarch-initiative/omim/issues/13#issuecomment-925642681

Possible solutions

It looks like the issue is that the BNode IDs are random by design. Here's what runs when a BNode is initialized:

def _serial_number_generator():
    """
    Generates UUID4-based but ncname-compliant identifiers.
    """
    from uuid import uuid4

    def _generator():
        return uuid4().hex

    return _generator

I think I'll need to make a new class inheriting from the BNode class in the rdflib.term module. What I'm thinking is that, since these BNodes appear to be tied to something concrete (OMIM codes, I think), I can instead override _serial_number_generator() to use a hashing algorithm, e.g. md5 with the OMIM code as input.