Document interpretation of mappings in data analysis

sbello commented 1 year ago

In manual mappings we often make broad/narrow/close mappings. It would be good to document how these mappings could/should be interpreted.

Interpretation is dependent on the direction the analysis is being made from (subject or object as the input). Assuming the input is the object (HPO term in most/all the files), I would propose that an analysis pipeline should use:

broadMatch (HP broader than MP) treat the same as exact as all MP terms will be relevant to the HP term
closeMatch (HP and MP are very similar but not exactly the same) give less weight than exact but more than narrow or related
narrowMatch (MP broader than HP) give less weight to these as some/many of the MP terms will not be as relevant to the HP term
relatedMatch give these the least weight or allow the user to exclude/include these on demand

We should include some examples of each type of mapping in the documentation.

We should also give a general guideline of how "broad" broad is. Some examples of broad matches:

same anatomical entity or process but less specific phenotype
Aplasia/Hypoplasia of the thymus (HP:0010515) and athymia (MP:0000705)
Abnormal intramembranous ossification (HP:0012790) and delayed intramembranous bone ossification (MP:0003420)
Abnormal ethmoid bone morphology (HP:0430005) and small ethmoid bone (MP:0030303)
more specific anatomical entity or process
Abnormal reflex (HP:0031826) and limb grasping (MP:0001513)
Decreased adipose tissue (HP:0040063) and decreased white adipose tissue amount (MP:0001783)
Morphological abnormality of the inner ear (HP:0011390) and abnormal otic capsule morphology (MP:0000039)

The question I struggle with most is how far up the tree to make a broad/narrow match. For example should 'Highly arched eyebrow' (HP:0002553) be mapped to anything in the MP? The closest term I can come up with is abnormal coat/ hair morphology (MP:0000367) which seems to me too distant to be of use in analysis. Similarly, the closest match to Cranial nerve paralysis (HP:0006824) in the MP would be abnormal nervous system physiology (MP:0003633). Again this seems to distant to be useful.

matentzn commented 1 year ago

@sbello I moved this issue to the SSSOM repo because it is of universal relevance in my opinion.

We cant immediately do something about this issue I think, but we should add our own thoughts and insights as we stumble across them!

graybeal commented 1 year ago

I think this is universally relevant but is only meaningful very narrowly. The interpretation depends not only on the ontologies in question (in this example), but on the intention of the mapper, and most importantly the application/end user that is using the mapping.

Crude example: I say "woman" hasBroader "human". If my query is "find studies with women", should the response include studies that have "human" but not "woman"? I think it's the end user who knows the answer to that, not the person who made the mapping. (And it wouldn't be any different if I said "human" hasNarrower "woman".)

That's only one use case; there are many use cases and the 'correct' answer is a function of the use case/

CloseMatch and relatedMatch are entirely subjective from the start, and suffer from the same "it depends on the user and the use case" impact on the result. I like the idea of more weight/less weight here, but in some cases I want any information that can be provided (so give me broad/narrow relations all the way up the tree), in other cases I want something I can be confident in (so I don't even want to use closeMatch, let alone any of the others).

matentzn commented 1 year ago

@graybeal good to hear from you again! :)

Crude example: I say "woman" hasBroader "human". If my query is "find studies with women", should the response include studies that have "human" but not "woman"? I think it's the end user who knows the answer to that, not the person who made the mapping. (And it wouldn't be any different if I said "human" hasNarrower "woman".)

I agree with you that it depends on the use case which predicates should be applied, and how. That is I think one of the things that we want to document - more "tutorial" like, not really SSSOM-reference level documentation. So maybe we should rephrase this issue here a bit towards building up a guide for thinking about "Use-case specific application of mappings for data scientists".

So @sbello's problem would be just one scenario of many, that we should characterise.

@sbello re-reading your issue, it is clear that you are struggling also with a mix of concerns:

The question I struggle with most is how far up the tree to make a broad/narrow match.

This is mixing the "use case" problem (How should the mapping be applied in a data analysis setting) with the representation issue (which mappings should I include in the mapping set, and which not).

So for you, I think the first step here is characterising clearly the target use case for your mapping set first. This will give you a clue as to "how far up to go for a broad match". Some use cases require you to go all the way up (faceted browsing on a website) and others do not like it at all when you go up more than a tiny bit ("give me the most similar term in the other ontology").

sbello commented 1 year ago

@matentzn The use case we have in mind is finding similar terms so we don't want to go all the way up but there is still a question, for me at least, of how far up is useful. But that could just be me overthinking things :)

matentzn commented 1 year ago

So in this case, we want to approximate a biological concept, "phenotypic similarity", which is not fully formalised as you know. By sense is that if its about similarity, not data grouping, I would say that if the E is not "part of the same homology cluster" then I would probably be a cautious

mapping-commons / sssom

Document interpretation of mappings in data analysis #310