monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

Exclusions: Some terms excluded more than once #128

Open joeflack4 opened 1 year ago

joeflack4 commented 1 year ago

Overview

I noticed that when I was generating the expanded exclusion files (reports/*_exclusion_reasons.robot.template.tsv, reports/*_term_exclusions.txt`), sometimes there were multiple entries.

Because I'm a little tight on time this week, I didn't check every ontology.

Sources where this does happen Zip file(s) contain 2 files: (i) an XLSX of the robot template which is the expanded list of excluded terms and their reasons, as well as counts for how many times the term appears in the list, and (ii) the unexpanded config/*_exclusions.tsv, which is used to generate that robot template.

Sources where this doesn't happen

Sources I haven't checked

Example

I found that for OMIM, it looks like a lot of terms are being excluded 2 times, once for being 'gene', and another for being 'nonDisease'.

Discussion

@matentzn @sabrinatoro Nico asked me to make an issue and tag Sabrina. I am aware that 'nonDisease' would be a parent of 'gene' in the exclusion hierarchy, and I guess we'd just want to include 1 entry for 'gene'. But for cases where a term is excluded for two reasons, and neither of the exclusion reasons are a descendant of the other, I think it makes sense for there to be two exclusions, no? Perhaps this is something that really doesn't happen in practice, I'm not sure.

I have very limited time on Mondo this week and next, but let me know if I should continue this analysis for 3 sources I haven't checked.

sabrinatoro commented 1 year ago

In the case of the OMIM terms having exclusion reasons 'nonDisease' and 'gene', you are correct, 'gene' is enough and a more specific exclusion reason. Therefore one ('gene') can be kept.

Generally, it is possible that a term can be excluded for different reasons, and therefore excluded terms could have more than one exclusion code.