monarch-initiative / gard

GARD ingest.
0 stars 0 forks source link

`gard-mondo.sssom.tsv` etc #10

Closed joeflack4 closed 1 year ago

joeflack4 commented 1 year ago

Updates

joeflack4 commented 1 year ago

@matentzn New release here with the new gard-sssom.mondo.tsv, also uploaded to the spreadsheet. I have some questions and comments (e.g. analysis results) that I'll bring to our Friday meeting.

matentzn commented 1 year ago

This can be merged when it also fixes https://github.com/monarch-initiative/gard/issues/11

joeflack4 commented 1 year ago

@matentzn No time to explain today. Will explain tomorrow. I made a notebook, etc. But if you are curious and really want to spend part of your weekend looking at this, take a look at these two files. (1) was derived from (2) using a tentative algorithm. Note that I was wrong about 178 things unmapped! There are actually only 3 unmapped GARD terms!

matentzn commented 1 year ago

This is great news, fantastic @joeflack4 !

matentzn commented 1 year ago

No duplicates?

With this data, I could finally do:

https://github.com/monarch-initiative/mondo/pull/6347

joeflack4 commented 1 year ago

@matentzn That's good new! However, I would hold off until you have looked at gard-mondo_curation.sssom.tsv.zip.

Concerning duplicates, there shouldn't be any in gard.sssom.tsv.zip because it is 100% in my algorithmic control. gard-mondo_curation.sssom.tsv.zip is the first step and has about ~6,000 duplicate rows (e.g. ~3,000 cases). Then, I use the following algorithm to remove the duplicates:

  1. By GARD term, collect all of the proxy mappings found, grouped by mapping predicate

    proxy_mappings: Dict[CURIE, Dict[MAPPING_PREDICATE, List[Dict]]] = {}
    # Example:
    # 'GARD:123': {
    #    'skos:exactMatch': [...]  # list of rows of GARD->OMIM/ODO->Mondo proxy matches
    #    'skos:narrowMatch': [...]  # list of rows of GARD->OMIM/ODO->Mondo proxy matches
    #     ...
    # }
    preds = set(mappings_by_pred.keys())
  2. Simple algorithm to pick mapping predicate Basically, if any of the proxy mappings is a skos:exactMatch, we consider the GARD an Mondo term to be mapped as skos:exactMatch. Else, if skos:narrowMatch exists, use that instead, and so on. You can see that this algorithm is very naïve. @matentzn This is where I need help. I think this algorithm needs to be looked at by you and possibly replaced with a better one. Or, we need to go through those ~3,000 in gard-mondo_curation.sssom.tsv.zip and manually curate some/all of them.

        pred = 'skos:exactMatch' if 'skos:exactMatch' in preds \
            else 'skos:narrowMatch' if 'skos:narrowMatch' in preds \
            else 'skos:broadMatch' if 'skos:broadMatch' in preds \
            else 'skos:relatedMatch'

I also set up a placeholder for a more granular algorithm, but didn't implement any logic yet

        # if preds == {'skos:narrowMatch', 'skos:exactMatch', 'skos:broadMatch'}:
        #     pass
        # elif preds == {'skos:narrowMatch', 'skos:broadMatch'}:
        #     pass
        # elif preds == {'skos:narrowMatch', 'skos:exactMatch'}:
        #     pass
        # elif preds == {'skos:exactMatch', 'skos:broadMatch'}:
        #     pass

I was going to discuss in more detail, but maybe this is sufficient for now. Take a look as well at this notebook: https://github.com/monarch-initiative/gard/blob/gard-mondo-sssom/analysis.ipynb

matentzn commented 1 year ago

We can review your algorithm in more detail a bit later this week, for now, I only care about this case:

Given only this rule:

CONDITION1: MONDO:123-exact-ORDO|OMIM:321 (from mondo.sssom.tsv) CONDITION2: GARD:123-exact-ORDO|OMIM:321 (from gard.sssom.tsv) RULE---> MONDO:123-exact-GARD:123

Can I assume, with only that rule, that

i. all GARD ids in gard.sssom.tsv get a corresponding MONDO id ii. no duplicate GARD ids result in the resulting mapping iii. no duplicate MONDO ids result in the resulting mapping

?

joeflack4 commented 1 year ago

Mondo mapping predicates currently ignored

@matentzn Actually, this could be an important point. I don't remember if this was my decision or something you suggested, but the predicate in mondo.sssom.tsv is ignored. So, if GARD:123-exact-ORDO|OMIM:321, and OMIM:321-something-MONDO:123, then result is GARD:123-exact-MONDO:123.

Question answers

But I can still answer your questions: i. yes ii. yes, but this is only because my algorithm picks just 1 of the mappings when there are more than 1, and I don't know if what it's doing is correct in all cases. iii. Do you mean 'mapping' or 'mappings'? One mapping is just 1 row, so there can only be one object_id / MONDO ID. But, it's certainly the case that the same MONDO ID can appear in multiple rows/mappings within the gard.sssom.tsv.

matentzn commented 1 year ago

the predicate in mondo.sssom.tsv is ignored. So, if GARD:123-exact-ORDO|OMIM:321, and OMIM:321-something-MONDO:123, then result is GARD:123-exact-MONDO:123.

Very important, the predicate_id should not be ignored! It must be "exact"!

ii. yes, but this is only because my algorithm picks just 1 of the mappings when there are more than 1, and I don't know if what it's doing is correct in all cases.

For now, I only care about the cases where both mappings are "exact". Everything else we can deal with at a later point. So its cool that gard.sssom.tsv contains non-exact mappings, but for the mondo-gard.sssom.tsv, I only care about exact mappings connecting using the algorithm I mention above (exact/exact).

But, it's certainly the case that the same MONDO ID can appear in multiple rows/mappings within the gard.sssom.tsv.

This should not be the case - if only exact mappings are taken into account.

joeflack4 commented 1 year ago

@matentzn Excellent answers. I'll have an update for you soon.

edit 2023/06/29: Just created an issue to possibly include non-exacts later:

joeflack4 commented 1 year ago

@matentzn I'm merging now. There are a few minor things I might do later and also I need to close #11, though I think it's done.

If you can look at my Jupyter notebook that'd be great. Mainly just observe that there are no duplicates per above discussion:

You can also look at the output files (look for the ones named with -exact in the filename: https://github.com/monarch-initiative/gard/tree/main/output/analysis

I'm going to upload these into the spreadsheet now. The ones with -exact (keeps only skos:exactMatch) I uploaded and put on as the leftmost tabs. The ones without the filtering are towards the right.