monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
14 stars 1 forks source link

Update biolink model associations we use to have classes for subject & object #233

Closed kevinschaper closed 2 years ago

kevinschaper commented 2 years ago

The changes for this ticket will go into the biolink model repo, but I want to write the details down here.

At the root level of Association, subject & object are defined as being just strings, but in some of the more specific associations either the subject or object is defined as pointing to a class or mixin. For example, right now GeneToDiseaseAssociation has the subject defined as a GeneOrGeneProduct, but the object is still defined as a string.

When we move to linkml pydanticgen, we're not going to be able to use strings as the subject or object when the slot is defined with a class, which means that right now creating GeneToDiseaseAssociation will look like:

GeneToDiseaseAssociation(id="...", 
                         subject=Gene(id="MGI:123"), 
                         predicate="...", 
                         object="MONDO:123")

Ideally, as we move to this format, we should update the biolink model to use classes on both sides of all of our associations, giving us:

GeneToDiseaseAssociation(id="...", 
                         subject=Gene(id="MGI:123"),  
                         predicate="...", 
                         object=Disease(id="MONDO:123"))

This is the right approach for the model & the code generation, but it's also good for us, because it's an opportunity to validate the subject & object as more than a string.

These are actually defined properly using a mixing to set the subject, but pydanticgen isn't setting the slot correctly

kevinschaper commented 2 years ago

The GO annotations are turning out to be more challenging.

functional annotation uses the gene ontology class mixin to set the subject, which is nicely nested as a class when I'm using https://github.com/linkml/linkml/pull/714

  functional association:
...
      object:
        range: gene ontology class

However, for the subject, the range is defined as molecular activity

  macromolecular machine to molecular activity association:
    slot_usage:
      object:
        range: molecular activity

which is a class, and I'm still getting a string in my generated pydantic classes

@dataclass(config=PydanticConfig)
class MacromolecularMachineToMolecularActivityAssociation(MacromolecularMachineToEntityAssociationMixin, FunctionalAssociation):
    subject: MacromolecularMachineMixin = Field(None)
    predicate: str = Field(None)
    object: str = Field(None)

That's controlled by this bit of logic in pydanticgen, which I'm not totally sure about.

                if s.range in sv.all_classes():
                    range_cls = sv.get_class(s.range)
                    #pyrange = f'"{camelcase(s.range)}"'
                    if range_cls.class_uri == "linkml:Any":
                        pyrange = 'Any'
                    elif s.inlined or sv.get_identifier_slot(range_cls.name) is None:
                        pyrange = f'{camelcase(s.range)}'
                        if sv.get_identifier_slot(range_cls.name) is not None and not s.inlined_as_list:
                            #collection_type = sv.get_identifier_slot(range_cls.name).range
                            collection_type = 'str'
                    else:
                        pyrange = 'str'

In this case, the molecular activity range is not inlined, and does have an identifier slot, the range is a string, rather than the class name. It seems like s.inlined here is reversed (inlined means use a string, I think?), but also, this seems to be shooting for always using a string when the range is a class with an ID, and I think we don't want that?

putmantime commented 2 years ago

@kevinschaper Please review where linkml and biolink PRs are for this.

kevinschaper commented 2 years ago

It turns out that with the linkml pedanticgen we're going to be just fine with using IDs for subject & object and we won't need to inline them - so I'm going to close this.