Closed joeflack4 closed 1 year ago
@matentzn New release here with the new gard-sssom.mondo.tsv
, also uploaded to the spreadsheet. I have some questions and comments (e.g. analysis results) that I'll bring to our Friday meeting.
This can be merged when it also fixes https://github.com/monarch-initiative/gard/issues/11
@matentzn No time to explain today. Will explain tomorrow. I made a notebook, etc. But if you are curious and really want to spend part of your weekend looking at this, take a look at these two files. (1) was derived from (2) using a tentative algorithm. Note that I was wrong about 178 things unmapped! There are actually only 3 unmapped GARD terms!
This is great news, fantastic @joeflack4 !
No duplicates?
With this data, I could finally do:
@matentzn That's good new! However, I would hold off until you have looked at gard-mondo_curation.sssom.tsv.zip.
Concerning duplicates, there shouldn't be any in gard.sssom.tsv.zip because it is 100% in my algorithmic control. gard-mondo_curation.sssom.tsv.zip is the first step and has about ~6,000 duplicate rows (e.g. ~3,000 cases). Then, I use the following algorithm to remove the duplicates:
By GARD term, collect all of the proxy mappings found, grouped by mapping predicate
proxy_mappings: Dict[CURIE, Dict[MAPPING_PREDICATE, List[Dict]]] = {}
# Example:
# 'GARD:123': {
# 'skos:exactMatch': [...] # list of rows of GARD->OMIM/ODO->Mondo proxy matches
# 'skos:narrowMatch': [...] # list of rows of GARD->OMIM/ODO->Mondo proxy matches
# ...
# }
preds = set(mappings_by_pred.keys())
Simple algorithm to pick mapping predicate
Basically, if any of the proxy mappings is a skos:exactMatch
, we consider the GARD an Mondo term to be mapped as skos:exactMatch
. Else, if skos:narrowMatch
exists, use that instead, and so on.
You can see that this algorithm is very naïve. @matentzn This is where I need help. I think this algorithm needs to be looked at by you and possibly replaced with a better one. Or, we need to go through those ~3,000 in gard-mondo_curation.sssom.tsv.zip and manually curate some/all of them.
pred = 'skos:exactMatch' if 'skos:exactMatch' in preds \
else 'skos:narrowMatch' if 'skos:narrowMatch' in preds \
else 'skos:broadMatch' if 'skos:broadMatch' in preds \
else 'skos:relatedMatch'
I also set up a placeholder for a more granular algorithm, but didn't implement any logic yet
# if preds == {'skos:narrowMatch', 'skos:exactMatch', 'skos:broadMatch'}:
# pass
# elif preds == {'skos:narrowMatch', 'skos:broadMatch'}:
# pass
# elif preds == {'skos:narrowMatch', 'skos:exactMatch'}:
# pass
# elif preds == {'skos:exactMatch', 'skos:broadMatch'}:
# pass
I was going to discuss in more detail, but maybe this is sufficient for now. Take a look as well at this notebook: https://github.com/monarch-initiative/gard/blob/gard-mondo-sssom/analysis.ipynb
We can review your algorithm in more detail a bit later this week, for now, I only care about this case:
Given only this rule:
CONDITION1: MONDO:123-exact-ORDO|OMIM:321 (from mondo.sssom.tsv) CONDITION2: GARD:123-exact-ORDO|OMIM:321 (from gard.sssom.tsv) RULE---> MONDO:123-exact-GARD:123
Can I assume, with only that rule, that
i. all GARD ids in gard.sssom.tsv
get a corresponding MONDO id
ii. no duplicate GARD ids result in the resulting mapping
iii. no duplicate MONDO ids result in the resulting mapping
?
@matentzn Actually, this could be an important point. I don't remember if this was my decision or something you suggested, but the predicate in mondo.sssom.tsv
is ignored. So, if GARD:123-exact-ORDO|OMIM:321
, and OMIM:321-something-MONDO:123
, then result is GARD:123-exact-MONDO:123
.
But I can still answer your questions:
i. yes
ii. yes, but this is only because my algorithm picks just 1 of the mappings when there are more than 1, and I don't know if what it's doing is correct in all cases.
iii. Do you mean 'mapping' or 'mappings'? One mapping is just 1 row, so there can only be one object_id
/ MONDO ID. But, it's certainly the case that the same MONDO ID can appear in multiple rows/mappings within the gard.sssom.tsv
.
the predicate in mondo.sssom.tsv is ignored. So, if GARD:123-exact-ORDO|OMIM:321, and OMIM:321-something-MONDO:123, then result is GARD:123-exact-MONDO:123.
Very important, the predicate_id should not be ignored! It must be "exact"!
ii. yes, but this is only because my algorithm picks just 1 of the mappings when there are more than 1, and I don't know if what it's doing is correct in all cases.
For now, I only care about the cases where both mappings are "exact". Everything else we can deal with at a later point. So its cool that gard.sssom.tsv contains non-exact mappings, but for the mondo-gard.sssom.tsv, I only care about exact mappings connecting using the algorithm I mention above (exact/exact).
But, it's certainly the case that the same MONDO ID can appear in multiple rows/mappings within the gard.sssom.tsv.
This should not be the case - if only exact mappings are taken into account.
@matentzn Excellent answers. I'll have an update for you soon.
edit 2023/06/29: Just created an issue to possibly include non-exacts later:
@matentzn I'm merging now. There are a few minor things I might do later and also I need to close #11, though I think it's done.
If you can look at my Jupyter notebook that'd be great. Mainly just observe that there are no duplicates per above discussion:
You can also look at the output files (look for the ones named with -exact
in the filename:
https://github.com/monarch-initiative/gard/tree/main/output/analysis
I'm going to upload these into the spreadsheet now. The ones with -exact
(keeps only skos:exactMatch
) I uploaded and put on as the leftmost tabs. The ones without the filtering are towards the right.
Updates