monarch-initiative / omim

Data ingest pipeline for OMIM.
6 stars 2 forks source link

Considerations from OMIM from Ada #104

Open nicolevasilevsky opened 11 months ago

nicolevasilevsky commented 11 months ago

Overview

From Ada: There are two big caveats for OMIM: Be concerned about (and likely exclude) null entries (those without a #, +, or % in front of them. There are some entries in OMIM that should be organized into PS, but have not been yet.

Sub-task list

Sub-task details

1. Unit tests for prefix types

Be concerned about (and likely exclude) null entries (those without a #, +, or % in front of them Chris recommends writing unit tests in ingest and collect one example of each of these. This will be a comparison.

joeflack4 commented 11 months ago

Thanks @nicolevasilevsky for relaying this.

I just did a quick investigation to see how we're currently treating these types.

The types, defined by their prefix characters (*, etc) are declared here in this short file omim_type.py.

There is some branching logic that uses these types here. Some things I noticed which perhaps are in conflict with what Ada said:

Potential conflict 1: * and +

Maybe this isn't a conflict. I just didn't see the comment from Ada mention *. So I don't know if she would think they should be handled differently or the same, but currently they are are handled the same (at least in the file linked above):

    elif omim_type == OmimType.GENE.value or omim_type == OmimType.HAS_AFFECTED_FEATURE.value:  # * or +
        omim_type = OmimType.GENE.value
        graph.add((omim_uri, RDFS.label, Literal(abbrev)))
        graph.add((omim_uri, RDFS.subClassOf, SO['0000704']))
        graph.add((omim_uri, BIOLINK['category'], BIOLINK['Gene']))

Potential conflict 2: ^ and NULL

  1. These are both treated the same. I also noticed that the comment from Ada doesn't mention ^, but I don't know if that's a mistake or intentional.
  2. We don't excluded them. We do ingest them, but we don't add any extra triples, e.g. we don't categorize them as genes, diseases, etc. From what I can see at a cursory glance, we just add their labels:
    else:  # ^ or NULL (no prefix character)
        graph.add((omim_uri, RDFS.label, Literal(cleanup_label(label))))

I'm not sure how much of a concern this is; I marked as low urgency.

matentzn commented 11 months ago

This needs to be raised at a tech call and has moderate urgency

nicolevasilevsky commented 10 months ago

Chris recommends writing unit tests in ingest and collect one example of each of these. This will be a comparison.

All ingest code should have unit tests with examples.

joeflack4 commented 10 months ago

@nicolevasilevsky Crap, I realized now that we didn't discuss this one at the last meeting I don't think.

  1. There are some entries in OMIM that should be organized into PS, but have not been yet.

I don't know what the issue is here. Do you? Or do you think we need to discuss again at a meeting at some point?

nicolevasilevsky commented 1 month ago

I think this is something that should be discussed on a tech call. I don't remember all of the context.