monarch-initiative / SEPIO-ontology

Ontology for representing scientific evidence and provenance information
49 stars 10 forks source link

ClinGen cmap: assessment strength, outcome, confidence clarification #7

Open larrybabb opened 7 years ago

larrybabb commented 7 years ago

@mbrush Can we dig into the stength, confidence and outcome associations that have been drafted on the 3-15-17 ClinGen cmap at our next Tuesday meeting?

I'd like to sort this out. 1) Is confidence a generalizable association? and if so, should it hang off of evidence lines, assertions or data or some combination? 2) the outcome and strength associations were created by ClinGen originally as a single "outcome" concept to capture the assessor's evaluation of whether a particular acmg rule/criterion was met/unmet(insufficient)/refuting and if met, the strength and direction (eg. strong path, very strong path, supporting benign, moderate benign, etc...) of that particular rule and associated data based on that assessor's objectivity.

We discussed putting the strength on the evidence which is the product of the criterion assessment, but wouldn't that be yet another assertion and would it be done independently of the criterion assessment?

I suppose we can seperate the strength assessment from the criterion/data evaluation/assessment (i.e. outcome) but I'm not sure I want to add "specialized" strength codes to evidence lines. It seems more logical to capture this with the criterion assessment since that ends up being a unique form of data that can be used as evidence when making an interpretation.

Maybe the "confidence" (which seems more generic) should be the concept which is associated to evidence. It seems that this is something that all users of evidence would want to be able to make on their own, not relying on a third party to assess. Unless of course someone wanted to use someone else's evidence line as supporting data for their own evidence. Even so, the agent that defines the evidenceline should (IMO) be the one that owns the confidence call of that evidence line.

I'll defer to your judgement to help sort this out and settle this for our near term goals.

We do need another pass at this to get it in a draft state that we can start documenting (IMO).

mbrush commented 7 years ago

These questions touch directly or indirectly on four aspects of CriterionAssessments (CAs) and the evidence they provide for target VariantInterpretations (VIs), which we have identified as important in our previous discussions. Summarizing below my recollection of the conclusions we came to for each of these.

1 - CA Outcome: First and foremost, we want to capture explicitly whether the 'assessor' who made the CA believed the assessed variant met the relevant ACMG Criterion. This information would hang as a cg:outcome of the CA. The working value set for this attribute is something like:

2 - Assertion Confidence: Second, we may want to enable description of the confidence an agent has that an assertion such as a CA is valid. At present such confidence information is not captured in the ClinGen data and there is no enumerated value set defined here. It was agreed that this 'confidence' should hang as an attribute of the CA itself, as it is a quality of the CA that is independent of how the assertion is used as evidence.

Notably, the hedging of VariantInterpretation calls as "likely" represents a confidence assessment of sorts - so we should consider alignment of how this is modeled when/if we model assertion confidence more generally.

3 - Evidence Strength: Third, we want to capture the strength of the evidence that some agent believed the CA provided for a target VI. The value set for these levels of strength map to the following terms as defined by the ACMG guidelines:

This "strength" is not an intrinsic property of the CA itself - but rather a property of the use of the CA as evidence. This 'use as evidence' is exactly what is captured by the reified EvidenceLine object in our model, and thus we hang the 'evidence strength' assessment from an EvidenceLine.

4 - Evidence Direction: Finally, and as discussed in #8 and #9, we want to capture the 'direction' of evidence a particular CA provides w.r.t. its target assertion - i.e. whether it is supporting, refuting, or insufficient. We concurred that this too is a property of the EvidenceLine. Open questions here are: (1) how an 'insufficient' strength relates to 'inconclusive' direction - i.e if these are always paired?; and (2) if we should encode direction as an attribute of an EvidenceLine (e.g. <:EvLn001 has_direction 'Supporting'>), or in the relation used to link the target assertion to its evidence line (e.g. <:VarItnerp001 has_supporting_evidence EvLn001>, <:VarInterp001 has_refuting_evidence :EvLn002>)


@larrybabb - the characterization above should serve to answer your question about the determination of the evidence strength as being a separate assertion that is done independently of the CriterionAssessment.

"We discussed putting the strength on the evidence which is the product of the criterion assessment, but wouldn't that be yet another assertion and would it be done independently of the criterion assessment?"

This is indeed the case - the strength assessment is a meta-assertion of sorts -- being about the strength of evidence provided by the CA for the VI, and falling outside of the critical path of our model. This meta-level assertion is one we agreed we did not want to explicitly represent as an assertion object in our model - that it was enough to represent the outcome of this assertion by hanging a strength attribute from the EvidenceLine (and optionally attributing the agent who made it).


One final clarification to make, prompted by your statement above that you see CriterionAssessments as

"a unique form of data that can be used as evidence when making an interpretation".

I would say that we are actively trying to move away from this idea that a CA is some "unique form of data". In fitting CriterionAssessments into the SEPIO model - we want CAs to align with a domain-independent and generalizable notion of an Assertion. SEPIO would treat CAs as a subtype of Assertion - specifically one whose creation is guided by an ACMG criterion.

In keeping with this alignment, we had agreed to separate the information conflated in the overloaded CA objects (i.e. an assertion as to whether the criterion was met, plus an assessment of the strength of evidence this provides). We would separate this information as outlined above, such that the former hangs from an Assertion, and the latter hangs from an EvidenceLine which captures of how this Assertion gets used as evidence.

Hoping this helps, and that my recollection of our past discussions is accurate.

bpow commented 7 years ago

That's a very helpful summary, @mbrush, I just wanted to add a few clarifying points (or maybe highlight some things that may be counter-intuitive to someone reading the model for the first time.

The first is that the fact of capturing the evidence strength would generally be made by the analyst making the criterion assessment-- this is a bit strange since we (at least to some extent) consider EvidenceLine to be a reification of the linkage between an assertion and supporting data/assertions. So as we've discussed, it seems a bit strange that an EvidenceLine can exist (and have a strength of evidence) when there is not yet an overlying assertion (e.g., VariantInterpretation) that it can be associated with. We decided that there is some apparent implicit VariantInterpretation there, and basically that we won't worry about this as a practical matter, but it may be something to call out in documentation somewhere.

The second comment is maybe a way of continuing the discussion about "has_evidence", "has_information" and the related terms giving directions (vs. just using these parent terms and having direction indicated in a property of the target entity).

I understand where you are coming from in wanting to be able to query a graph/tuple store for these terms without having to traverse the graph to the target entity to find direction. On the other hand, those of us in the relational database-ish world balk a bit at what amounts to adding additional properties to the Assertion and EvidenceLine types-- where only one or two of the properties would be non-null for any given relationship (we could use both "has_evidence" and "has_supporting_evidence" to describe a relationship, but we would not use both of these and also "has_refuting_evidence"...). So, sure, this is an implementation issue, but one that happens early in the implementation process.

We would also have to be careful about how we specify how those terms relate between evidence and information... For assertion A, evidence line B, information C. I would thing that we would represent something being refuted as:

However, someone could also argue that we should represent this as:

Since C supports B in refuting A. I don't think the latter is what you intend, but either way it should be made crystal clear in the specification.

Sorry if this is hijacking an issue that would otherwise be closed, feel free to open this up as another if you think this is not really related to @larrybabb 's question (or we can discuss and address on a subsequent call).

cbizon commented 7 years ago

A couple of minor points:

1) In the VCI, (the only current implementation of this), the Interpretation will never be implicit. Users begin by creating the interpretation and then go about adding evidence to it. So adding an evidence line will not require even the fiction of an implicit interpretation (though future implementations might).

2) The fact that both the property from assertion to evidence line and evidence line to information have direction indicates that this direction is really a property of the evidence line only, especially if they must always point in the same direction. That may be inconvenient for other reasons, of course, but feels the most natural to me.

mbrush commented 7 years ago

@bpow re: the idea of capturing evidence direction using three properties (has_supporting_evidence, has_refuting_evidence, has_inconclusive_evidence)

On the other hand, those of us in the relational database-ish world balk a bit at what amounts to adding additional properties to the Assertion and EvidenceLine types-- where only one or two of the properties would be non-null for any given relationship

If I understand correctly, any problems that "additional properties" pose for a relational schema really depends on the model the schema implements. If the schema implements these properties as separate columns/elements in a table, then it is indeed true that undesirable bloat would be introduced - as only one of these three would be non-null for a given Assertion-EvidenceLine pair. If however the schema defines a single column/element that is used to capture which of the three possible 'direction' properties applies to a given Assertion-EvidenceLine pair - then no null values or data bloat would be created. Caveat to my argument here is that I am not versed at all in relational database design - so please tell me if this is non-sensical.

mbrush commented 7 years ago

@bpow regarding:

We would also have to be careful about how we specify how those terms relate between evidence and information... For assertion A, evidence line B, information C. I would thing that we would represent something being refuted as: --- A has_refuting_evidence B --- B has_refuting_information C However, someone could also argue that we should represent this as: --- A has_refuting_evidence B --- B has_supporting_information C

The second pattern you describe above is actually what SEPIO prescribes. SEPIO provides only one property (has_supporting_information) for linking an EvidenceLine to the information that 'comprises' or 'supports it. This information captured under an EvidenceLine is always considered 'supporting' with respect to the EvidenceLine, even if it is refuting with respect to the Assertion that the EvidenceLine supports. In such a case, we would use the second pattern you lay out above to represent things (i.e A has_refuting_evidence B, and B has_supporting_information C).

In plain language, we would describe this scenario something like "Information C supports/comprises EvidenceLine B, where EvidenceLine B refutes Assertion A". By contrast, the first pattern above says that Information C is refuting for EvidenceLine B, where EvidenceLine B is refuting for Assertion A - this introduces a double negative that implies the wrong thing (i.e. that information C would actually support Assertion A, which is not the case).

@cbizon I think you are in agreement the second point you make in your comment above - i.e. we only need to specify 'direction' on the EvidenceLine. In short, we would have three potential properties for linking Assertions to EvidenceLines (has_supporting_evidence, has_refuting_evidence, has_inconclusive_evidence), and only one property for linking an EvidenceLine to its underlying information (has_supporting_information).

larrybabb commented 7 years ago

@bizon @mbrush @bpow I'm in the process of cleaning up the examples for ClinGen. I am focused on filling in all the assertion outcomes and subsequent evidence strengths.

For assertion outcomes I have 3 controlled terms:

So, what (if anything) do I put as "evidence strength" when the underlying assertion outcome is REFUTE or INSUFFICIENT?
Is this always blank or do I make a REFUTE and INCONCLUSIVE/INSUFFICIENT strength term to describe the evidence that is built on Refuting and Insufficient assertions, respectively?

cbizon commented 7 years ago

When the underlying assertion outcome is REFUTE or INSUFFICIENT, that assertion does is not taken into account in the calculation of pathogenicity if the algorithm in the ACMG paper is being followed (e.g. if you are using the pathogenicity calculator).

We clearly need to represent that case, so we can be explicit or implicit about it. Implicit would be leaving evidence strength blank; explicit would be making a new term like "No support" and making the evidence strength that term.

I prefer the explicit approach.

larrybabb commented 7 years ago

We need to discuss this in our next meeting. Heidi, Tristan, Marina, Steven and I all discussed the practicality of capturing Insufficient and Refuting versus simply calling it all “not met” in the meeting this morning. I think we may need to step back and re-evaluation our approach to simplify and possibly remove these finer types of outcomes.  More to come.

From: cbizon notifications@github.com Reply-To: monarch-initiative/SEPIO-ontology reply@reply.github.com Date: Monday, April 3, 2017 at 10:29 AM To: monarch-initiative/SEPIO-ontology SEPIO-ontology@noreply.github.com Cc: Larry Babb larry.babb@gmail.com, Mention mention@noreply.github.com Subject: Re: [monarch-initiative/SEPIO-ontology] ClinGen cmap: assessment strength, outcome, confidence clarification (#7)

When the underlying assertion outcome is REFUTE or INSUFFICIENT, that assertion does is not taken into account in the calculation of pathogenicity if the algorithm in the ACMG paper is being followed (e.g. if you are using the pathogenicity calculator).

We clearly need to represent that case, so we can be explicit or implicit about it. Implicit would be leaving evidence strength blank; explicit would be making a new term like "No support" and making the evidence strength that term.

I prefer the explicit approach.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

mbrush commented 7 years ago

I think I agree with Chris' point that from the ClinGen/ACMG perspective, it is not critical to assign strength in these cases. The ACMG guidelines capture strength of evidence only in cases where the criterion is 'met' and the 'class' of the criteria (P vs B) is the same as the call made in the variant interpretation. The ACMG framework doesn’t consider the 'strength' of evidence that is 'inconclusive' or 'refuting'.

From SEPIO's perspective, we don't typically capture assertion of evidence strength, so we have no strong feelings on if or how ClinGen implements their rules here. In cases where a criterion outcome is 'insufficient', the evidence provided has no direction (i.e. is not supporting or refuting). So here, a strength value of 'no support' or 'n/a' as Chris suggested would seem reasonable.

However, in cases where the criterion is 'refuted' there is a direction to the evidence that the CA provides, and thus strength becomes relevant. It wouldn’t be right to say 'no support' in this case. But assigning a particular strength value in these cases is a very tricky proposition, and one I would not advocate worrying about in this round of modeling. It sounds like you may be hedging away from distinguishing 'unmet-insufficient data' from 'unmet-refuting' anyway, so the point may be moot.

If the issue does become relevant, I started a matrix to describe all combinations here, which may help us to consider how to assign direction and strength for the trickier cases.

bpow commented 7 years ago

Working through the backlog of emails, I would like to make a few comments as I am thinking of them.

  1. Regarding the relational storage of has_supporting_evidence, has_refuting_evidence, etc-- @mbrush's suggestion would of course work, but does require a more "active" processing of the data as it goes into relational storage (i.e., a more generic "take this json and store it by a fixed mapping" wouldn't work). That's not horrible, but something to consider.

And a few semantic points:

  1. The clarification that there is only one allowable connection for evidence line to information (nee data) is helpful. Would you consider (as long as you are changing "data" to "information") renaming this to has_information or has_relevant_information or something like that instead of has_supporting_information? I think the use of "supporting" here has the potential to cause confusion since it is used in the analogous link from Assertion to EvidenceLine to represent direction, while the link from EvidenceLine to Information does not have direction.

  2. For consistency's sake, we should re-think the enumeration REFUTE, INSUFFICIENT, and MET. REFUTE is the infinitive or present-tense form of a verb, while INSUFFICIENT is an adjective and MET is the past-tense or past-participle form of a verb (but probably appropriate to use as an adjective). I think it makes the most sense to stick with an adjective or past-participle here.

mbrush commented 7 years ago

Just posing a devil's advocate perspective for posterity here, w.r.t. the suggestion that:

Would you consider (as long as you are changing "data" to "information") renaming this to has_information or has_relevant_information or something like that instead of has_supporting_information? I think the use of "supporting" here has the potential to cause confusion since it is used in the analogous link from Assertion to EvidenceLine to represent direction, while the link from EvidenceLine to Information does not have direction.

One coould argue that use of 'supporting' in both these cases is analogous - in that a has_supporting_information link from an EvidenceLine to an item of Information does convey the fact that the information supports the assessment of the EvidenceLine' direction (as opposed to disputing it). It is conceivable that one would want to cite 'disputing' information form an EvidenceLine - e.g. if a study produces several data items where all but one support the argument made by the EvidenceLine, but perhaps one result or statistical calculation does not (e.g. a p-value supports a significant difference between control and test subjects, but a z-score does not). One could still record this dissenting data item as 'disputing_information'.

bpow commented 7 years ago

I don't think that the point you make is as much a "devil's advocate perspective"-- I think it may be more consistent with the point I was (perhaps not so well) trying to make.

Looking at the current owl file, I don't see a 'disputing_information' property-- the only way to link from EvidenceLine to Information is using 'has_supporting_information'. There is no neutral or negating/disputing term to relate evidence lines to information in the current ontology.

In the ClinGen data modelling working group, we decided to use the neutral term ('has_evidence') in relating Assertions to EvidenceLines, and would probably use a neutral term in relating EvidenceLines to Information if it were available.