phenopackets / phenopacket-schema

Repository for the GA4GH phenopacket schema
https://phenopacket-schema.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
78 stars 30 forks source link

Discuss separation of positive from negative phentoypes #47

Closed drseb closed 5 years ago

drseb commented 5 years ago

@cmungall we had discussions on that in the context of GO and HPO. We decided that it makes sense to clearly separate positive from negative annotations, because it is otherwise likely that users will just iterate the list of phenotypes and forget to check if the phenotype is actually negated. This is why we have separate annotation files. It seems to me that this danger exists in the current phenopacket schema as well. https://phenopackets-schema.readthedocs.io/en/latest/phenotype.html#phenotype-element I.e. users will iterate through the list of phenotypes and would have to check for each phenotype if negated is true. I would favour a solution where users have to iterate the positive and the negative phenotypes separately. Opinions? Is this doable?

pnrobinson commented 5 years ago

@drseb I added some Java code that shows how to easily separate positive and negative annotations using the current structure (to the read the docs, see the bottom of the spherocytosis example in the PR). I would guess that for everybody who forgets to check whether the phenotypes are negative there would be people who forget to check that the phenopacket has two lists. I do not feel strongly about which way to go and would leave it to @cmungall and @julesjacobsen and you, but I do think that we need to massively improve the quality of our documentation (which was not that great for the initial years of the HPO project and still could be much improved :-0 ).

cmungall commented 5 years ago

one way to make it idiotproof is to remove the negated flag and have two lists here:

https://github.com/phenopackets/phenopacket-schema/blob/master/src/main/proto/org/phenopackets/schema/v1/core/base.proto#L160

But given that we have a severity modifier (which seems to allow 3 different ontologies which seems a problem) then we have an odd discontinuation between "very very mild" and "negated". Also we need to state how negation composes with severity. If I combine negation with "moderate" then logically I am stating that a positive severe phenotype is a possibility. The safest thing here may be a PresentPhenotype and AbsentPhenotype class, with severity only a field in the former, but this feels ugly, esp without polymorphism.

drseb commented 5 years ago

@pnrobinson:

I would guess that for everybody who forgets to check whether the phenotypes are negative there would be people who forget to check that the phenopacket has two lists.

With the difference, that it has much worse consequences to assume a negated phenotype as a positive phenotype than to forget the negated list.

I do not feel strongly about which way to go

I somehow do have strong feelings about this. I think it is really a non-optimal design decision.

@cmungall:

we have an odd discontinuation between "very very mild" and "negated".

Wait. "Negated" belongs to the frequency-modifier (for annotating diseases). "mild" belongs to the severity domain (for patients and also diseases). They are not really 100% part of one spectrum

one way to make it idiotproof is to remove the negated flag and have two lists here:

this is exactly what I am suggesting since several weeks.

pnrobinson commented 5 years ago

One general problem with frequency modifiers is that we can write a positive annotation with the frequency 0/17, meaning that a feature was not observed in 17 patients, or we can write a negative annotation with NOT (which has less information). This is important for the HPO annotation files, but not as much for phenopackets which are intended to be for individual patients and thus either you have something or you don't.

julesjacobsen commented 5 years ago

I think a single list of Phenotype is more succinct and makes validation a little easier - all the phenotpyes are in a single place at least. It could be easy to add the same phenotype to both present and absent lists. The problem with two lists is that people could equally incorrectly add everything to one list or another. The scope for silly errors is endless so I'm not completly convinced by the safety argument.

I think we could use clearer wording perhaps - 'negated' in the case of a patient phenotype means 'absent' which is easier to imply the meaning. This might not work in a more general sense, but then this is specifically for phenotype, so might not be an issue.

drseb commented 5 years ago

I think it is a suboptimal design and I am looking forward to how a solution to this will be found or if the decision is now to keep it with this current approach.

pnrobinson commented 5 years ago

I would tend to agree with Jules that the scope for silly errors is endless. On the other hand, I think Seb did see at least one group, maybe more, not notice negative annotations. When was this Seb? What kind of groups were they? It does seem that we are now moving up the sophistication ladder (from TSV to json/PB) and that it will simply require a little more thought to use phenopackets, meaning that nobody can just slap together a perl script like we did on the good old days of the 2000s. I think this is good because it will force people to RTFM. I would say the best way to prevent errors is to really improve our documentation and show many examples.

drseb commented 5 years ago

On the other hand, I think Seb did see at least one group, maybe more, not notice negative annotations. When was this Seb?

I think I have enough experience and have seen enough other people not paying attention to a negation flag. Be reminded (if it isn't enough that I think this separation of annotations is necessary) that we learned this from GO, where @cmungall suggested to me to separate out the NOT annotations for HPOA. I find it really strange that the concerns about the consequences of interchanging a positive with a negative remain unheard, but I tried my best.

Thanks for your time and consideration.

mellybelly commented 5 years ago

Guys, I really think there should be two lists. Having them together will absolutely result in incorrect usage; I don't see the small advantage of having a single file, and it is easy enough to combine after the fact. If something can go wrong, it will, and it is critical that we avoid this.

julesjacobsen commented 5 years ago

A counter example is that FHIR represents negated phenotypes in the same list as the observed, just like us. One pro point is that any given phenotype is expressly marked as negated so can't be misinterpreted when taken out of context. Putting them in different lists makes the phenotype contextual, which could lead to the same misinterpretation error.

drseb commented 5 years ago

And FHIR is never wrong!?

cmungall commented 5 years ago

I have a lot of experience working with developers of varying degrees of expertise consuming data from a variety of formats. I can tell that the error of ignoring fields is very common among all abilities. Many people will not read the specification. They will look at a handful of examples of json or code and make assumptions.

Yes, we cannot guard against all possible errors, but anecdotally I can tell you this particular type of error is likely to be common. And as Seb says this class of error can have serious consequences. However, I agree there are other errors we can't fully guard against, especially when it comes to ontologies.

I completely agree with Jules that guarding against this class of error has negative consequences of other parts of the design, and will complicate some code.

I can still go either way, but I just want to make sure that as a group we are fully aware of the consequences of whatever choice we make. If we keep negation as a field then we should include very prominent warnings in the RTDs and elsewhere to help guard against.