mapping-commons / sssom

Simple Standard for Sharing Ontology Mappings
https://mapping-commons.github.io/sssom/
BSD 3-Clause "New" or "Revised" License
143 stars 24 forks source link

Consider interoperation/ingest with Ontologymaps #222

Open mellybelly opened 2 years ago

mellybelly commented 2 years ago

Especially this entry that we really need focusing on ICDO: https://progenetix.org/service-collection/ontologymaps/ @mbaudis

mcourtot commented 2 years ago

leaving a comment to watch this issue

mbaudis commented 2 years ago

See older discussions https://github.com/monarch-initiative/mondo/issues/1148#issuecomment-1013554769

Briefly: For progenetix.org we have done "a bit" of mapping between NCIt and ICD-O Morphology+Topography combinations. This mostly has been done to utilize the NCIt hierarchies in contrast to the unwieldy dual-arm ICD-O (which otherwise is rather well suited to capture cancer diagnoses ...) for the Progenetix cancer genomics resource (which contains >100k individual samples from literature etc.).

However, our work partially was based on older NCIt cancer core codes & more refinement has been done there. Also, this covers only some hundred codes & needs a systematic extension.

matentzn commented 2 years ago

@mbaudis thanks for your input! I am looking the the ontology maps API, and wondering if we should pull and republish your mappings in a suitable mapping commons in SSSOM format? I am wondering right now about mapping precision. The API returns a tuple with three ids, and I wonder how they related? Are they all supposed to be mutually exact? Does the order in this tuple matter?

[
                        {
                            "id": "NCIT:C9383",
                            "label": "Rectal Adenocarcinoma"
                        },
                        {
                            "id": "pgx:icdom-81403",
                            "label": "Adenocarcinoma, NOS"
                        },
                        {
                            "id": "pgx:icdot-C20.9",
                            "label": "Rectum, NOS"
                        }
                    ],
mbaudis commented 2 years ago

@matentzn Feel free - as a first step ... I'm really from a different area & haven't found time to work on mapping formalities etc. But there has been a veeeerrrryyy long need for these mappings & this now seems like a good opportunity to pick it up again.

Order: doesn't matter. It is basically (icdom+icdot) <=> NCIT Exclusivity: No, since we miss many mappings. Some entities do not exist in NCIT or not at the best corresponding level; for others, we just haven't done the best assessment w/ the newest NCIT. Relevant reviews had started just before COVID & stalled there (though the mapping service came then later).

IMO it would be worth a real project to do this systematically - happy to help! And to learn, how to best express such mappings formally correct ¯\_(ツ)_/¯.

mbaudis commented 2 years ago

Also seems like a great opportunity to do this in conjunction with ICGC ARGO metadata work @mcourtot ?!

mbaudis commented 2 years ago

@matentzn ... and FYI all term groups for NCIT / ICD-O we have are in response.results.termGroups through https://progenetix.org/services/ontologymaps/?filters=NCIT,pgx:icdo&filterPrecision=start

(2022-08-26: fixed wrong icdo partial)

matentzn commented 2 years ago

@mbaudis sorry to be daft, could you elaborate what this endpoint provides? What is a term group?

mbaudis commented 2 years ago

@matentzn Not daft at all - this is just an ad hoc way to express equivalency of terms from different classification systems, w/o assuming a 1:1. I.e. for NCIt <=> ICD-O you will have two terms from the different ICD-O arms corresponding to a single NCIt term:

            {
              "id": "NCIT:C4017",
              "label": "Ductal Breast Carcinoma"
            },
            {
              "id": "pgx:icdom-85003",
              "label": "Infiltrating duct carcinoma, NOS"
            },
            {
              "id": "pgx:icdot-C50.4",
              "label": "Upper-outer quadrant of breast"
            }

Alternative mappings are expressed as separate groups, e.g. here (with a not-so-granular topography):

            {
              "id": "NCIT:C4017",
              "label": "Ductal Breast Carcinoma"
            },
            {
              "id": "pgx:icdom-85003",
              "label": "Infiltrating duct carcinoma, NOS"
            },
            {
             "id": "pgx:icdot-C50.9",
              "label": "Breast, NOS"
            }

For ICD-O T <=> UBERON there would just be 1:1 groups.

This is obviously an "internal format" and could be expressed much more systematically ...

CAVE: There is a lot of noise here - some earlier systematic work on cleaning up mappings has been blurred by A new samples w/ diagnoses sometimes not properly adjusted, & B great advances in the NCIt cancer codes since we did a bit of systematic work, last in early 2020 ... Therefore this is mostly for prototyping - e.g. how to ingest this conceptually - and needs cleanup & extension.

Another point: There are many 1:1 mappings between single NCIt and ICD-O M(orphology) terms where then the ICD-O M+T doublets would have to list all topography options (e.g. "Adenocarcinoma").

matentzn commented 2 years ago

Awesome thanks, got it. One problem I see with simply converting your mappings is that the term groups do not capture semantic precision, without which we cannot guess the appropriate semantic mapping relation (a prerequisite for SSSOM). For example, NCIT:C4017 ("Ductal Breast Carcinoma") seems to be a broad match of Infiltrating duct carcinoma, NOS. Maybe I am mistaken. Would you be confident to assign all mappings the skos:exactMatch mapping relation, which means that both concepts mean the exact same thing?

Secondly, I think while icdot->Uberon is definitely sssom material, icdom->icdot is not really. We were getting into the realms of knowledge graphs there. But just to think this issue through to the end: what is the relationship between icdom and icdot terms that co-occur in the same term group?

cmungall commented 2 years ago

The formal way to represent the ICDO tuples is OWL expressions of the form pgx:icdom-8500 and has-location some pgx:icdot-C50.9

We have a general ticket on post-composition of concepts in #108.

One approach we could take here is to create an OWL file that materializes these expressions. They could have IDs that are essentially concatenations. We would publish a simple 3 column DOSDP TSV. Users would need to join to get the relationship between NCIT and each ICDO axis. Would could also material the join as SSSOM using predicates such as anatomic_aspect_has_exact_match, morphological_aspect_has_exact_match

mbaudis commented 2 years ago

That's what I thought but wanted some confirmation ...

One approach we could take here is to create an OWL file that materializes these expressions. They could have IDs that are essentially concatenations:

... but a question for me would be if something like an "Adenocarcinome" w/o addtl.topographic information (NCIT:C2852 - Adenocarcinoma) which corresponds to ICD-O 8140/3 (pgx:icdom-81403) should be represented just as icdom-81403 or as icdom-81403~icdot-C80.9 (combining adenocarcimoma w/ the code for unknown primary site)[^1]?

I, preferably, would do a "complete" representation of ICD-O 3 that would both include all unique M & T codes as well as all sane pairs. I.e. all primary codes and all post-compositions.

But This is more of a question towards how this should be done (from a non-ontologist). Precedence?

Also: Similarly expressed here...

[^1]: IMO that is different since it has information that the site isn't known...

mbaudis commented 2 years ago

@matentzn Regarding UBERON <-> ICD topographies: This had been done by @qingyao and is documented at https://github.com/progenetix/icdot2uberon. OBO file & score etc. available - so this should be usable...

mbaudis commented 2 years ago

I have created a map with concatenated codes which uses:

The file is hosted in our working byconeer repo which is a bit of a "procedurally maintained" place; please consider the table as test for further procedural discussions, not as a final product.

matentzn commented 2 years ago

@mbaudis As a representation like this is currently beyond the scope of SSSOM, we will need to circle back to this after https://github.com/mapping-commons/sssom/issues/108 and #36 are addressed in some way.

There is quite a few things to consider when folding composed expression into any mapping vocabulary. Technically its not hard (as evidenced by your used of ~ in your mappings), but socio-technically it is not at all straight forward, because we need to ensure the sssom extension is general enough to cover for all future cases of complex mappings. This is tough, because no one can forsee all possible variations, but see #108 for an idea using template expressions.