monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
6 stars 3 forks source link

alignment files including parents that are obsolete in Mondo #448

Open sabrinatoro opened 4 months ago

sabrinatoro commented 4 months ago

for example: slurp/ordo.tsv file. The parents reported for new terms includes

These obsolete or too high-level terms should not be brought in as parents Example from slurp/ordo.tsv

MONDO:0958100 | autoinflammatory syndrome with acne and/or hidradenitis suppurativa | Orphanet:653434 | MONDO:equivalentTo | Autoinflammatory syndrome with acne and/or hidradenitis suppurativa |   |   | MONDO:8000033|MONDO:0017954|MONDO:0017370

Note: I understand that it might be problematic if only terms that have parents in Mondo are brought in. If it is the case, these terms can be brought in, but parents should not be included in the spreadsheet.

matentzn commented 4 months ago
matentzn commented 4 months ago

@twhetzel this should be prioritised right after ICD11 and MedGen probably. I lost a bit track of Joes priorities now, so I leave it to you to fold this into the schedule?

sabrinatoro commented 4 months ago

Can you provide us with a simple exclude list for parents? "too high level"

Currently, the "too high level" term that is consistently reported is MONDO:0000001 =disease = the highest term in the ontology (parent of "human disease" and "non-human animal disease").

Can you provide a short statement on why "too high parents" are confusing?

"disease" as a parent, is not specific enough to be useful. Every term should at least be either in the "human disease" branch or in the "non-human animal disease" branch. From a curation perspective, if a term has the parent "disease" or even "human disease", we will have to review this term and find a more specific parent, minimally one of the "high-level classification" term for human diseases (see list below)

Mondo ID term name
MONDO:0002409 auditory system disorder'
MONDO:0002657 breast disorder'
MONDO:0045024 cancer or benign tumor'
MONDO:0004995 cardiovascular disorder'
MONDO:0019040 chromosomal disorder'
MONDO:0003900 connective tissue disorder
MONDO:0004335 digestive system disorder'
MONDO:0021147 disorder of development or morphogenesis'
MONDO:0002022 disorder of orbital region'
MONDO:0024458 disorder of visual system'
MONDO:0005151 endocrine system disorder'
MONDO:0005570 hematologic disorder'
MONDO:0003847 hereditary disease'
MONDO:0043543 iatrogenic disease'
MONDO:0700007 idiopathic disease'
MONDO:0005046 immune system disorder'
MONDO:0005550 infectious disease'
MONDO:0021166 inflammatory disease'
MONDO:0002051 integumentary system disorder'
MONDO:0005066 metabolic disease'
MONDO:0044970 mitochondrial disease'
MONDO:0006858 mouth disorder'
MONDO:0002081 musculoskeletal system disorder'
MONDO:0005071 nervous system disorder'
MONDO:0005137 nutritional disorder'
MONDO:0700003 obstetric disorder'
MONDO:0100366 occupational disorder'
MONDO:0024623 otorhinolaryngologic disease'
MONDO:0100086 perinatal disease'
MONDO:0029000 poisoning
MONDO:0021669 post-infectious disorder'
MONDO:0002025 psychiatric disorder'
MONDO:0043459 radiation-induced disorder'
MONDO:0005039 reproductive system disorder'
MONDO:0005087 respiratory system disorder'
MONDO:0002254 syndromic disease'
MONDO:0043839 ulcer disease'
MONDO:0044991 upper digestive tract disorder'
MONDO:0002118 urinary system disorder'
matentzn commented 4 months ago

From a curation perspective, if a term has the parent "disease" or even "human disease", we will have to review this term and find a more specific parent, minimally one of the "high-level classification" term for human diseases (see list below)

I think our SOP should really include a moment of pause here (this is exactly why I was asking). I personally hoped the "parent" was mere a suggestion and is always carefully reviewed during migration. This is why I was not originally worried to include very high level parents - because I knew someone was looking at them anyways and throw them out..

sabrinatoro commented 4 months ago

I think our SOP should really include a moment of pause here (this is exactly why I was asking). I personally hoped the "parent" was mere a suggestion and is always carefully reviewed during migration. This is why I was not originally worried to include very high level parents - because I knew someone was looking at them anyways and throw them out..

I see where you come from, and I want to reassure you that a curator reviews the list of suggested parents before creating the new terms. In some cases, 5 parents are suggested (IDs separated by a pipe), so it is a lot of copy/paste and manual removal. It is easy to recognize MONDO:0000001, and remove it from parents, so it is not a big issue. But since we will never add it as a parent, it is not useful to have it reported (but again, it might not be worth the technical work to exclude it from the parent list).

matentzn commented 4 months ago

This makes sense now, thank you @sabrinatoro :)