Recording variant status of genotypes (natural or engineered)

ValWood commented 4 years ago

For pathogens and host genotypes, many natural variants occur and for these, we want to capture the differences to some nominally WT variant. We need to be able to distinguish when a variant is naturally occurring or 'engineered' for any specific allele.

So, we would like an additional field in the genotype pop up to be able to select one of either I) Natural variant (NV) or ii) Engineered variant (EV).

We also thought that later if researchers took a natural variant, and then engineered a different residue, we would be able to combine these as multi-allele phenotypes. Although I am not sure about this? it might imply that 2 copies of the gene are present. I have forgotten how this is specified. If not, maybe this is something we can look into ( this is more future-proofing, although the NV/EV distinction is required now I don't think we have examples of editing to a natural variant right now @CuzickA can confirm)

jseager7 commented 3 years ago

I'm not sure which idea would work best here, but I thought it was worth suggesting this alternative idea to try and move was away from needing to put too much emphasis on a WT sequence or function. This opens the can of worms about reference genomes, non-reference genomes and pan-genomes.

I don't know what the best answer is, but I suspect it would benefit community curators if we could simplify or reduce the data we need to curate (not to mention the benefit of not having to revisit every curation session). The fact that this issue has been so difficult to understand during its discussion makes me think that it may not be easy for community curators to reason about either.

Now that we have the ability to link metagenotypes to their controls, I'm not sure why it's important to continue to stress this distinction between reference strains and other strains. There's all kinds of problems with the reference strain distinction:

we already know that the coverage of reference proteomes is incomplete;
I suspect PHI-base will be covering a lot of pathogens that are less studied and probably less sequenced;
based on previous discussion, there seem to be pretty arbitrary rules about deciding which strain is the reference strain;
there are cases where the reference strain isn't even the most relevant strain for experimental study (which happened with Triticum aestivum, if I remember correctly); and so on.

It sounds like the distinction between mutations arising from natural variation (NV) and mutations caused by experiments (EV) could be useful (and it feels more straightforward), but I don't have the expertise to say how useful it is. Assuming the information is usually present in publications, it might at least be easier to curate.

jseager7 commented 3 years ago

Following the meeting today, we've decided on a simpler solution that mostly follows Alayne's suggestion.

@ValWood I'd appreciate your feedback on these suggestions, particularly points 3 and 4, because these may be difficult to change if we later decide to take another approach – of particular importance is whether we should treat the origin of the variation as a property of the allele or the genotype, especially in cases where a multi-allele genotype contains alleles of natural origin and alleles engineered by the experiment.

We will focus on curating natural variation for mutant genotypes, where relevant. For example, the curator will only have to tick a box to indicate when the allele (i.e. single allele genotype) was caused by natural variation.
Variation will not be specified for control metagenotypes, because it's too difficult to specify what the variation is relative to (leading us back into the WT-reference / WT-other problem). Mutant metagenotypes should not have this problem because the variation is understood to be relative to the control metagenotype.
If an allele is not specified to be caused by natural variation, then it is assumed to be caused by engineered variation. Currently I think the plan is to omit the tag in the case of an engineered variation, but we could default to engineered variation on new alleles if we think that would be clearer. I'm a bit concerned about extending this 'engineered by default' assumption to control metagenotypes, because in that case we truly don't know (or don't care) what the origin of the variation is.
Multi-allele genotypes containing at least one engineered allele will be classified as engineered genotypes. While we considered that users might want to see the origin of the variation for each allele in a genotype, we ultimately decided that the most notable case was when a phenotype arose exclusively from natural variation, and that it would be simpler (from a user interface perspective) to keep the variation at the level of the genotype, and merely classify genotypes including any form of engineered variation as engineered (or at least, non-natural).

ValWood commented 3 years ago

It seem that it should apply to the allele. I haven't yet annotated any multi allele genotypes for PHI-base , but I guess later there will be cases later where people have a natural variant, AND engineer another gene in the same species?

kimrutherford commented 3 years ago

Hi Val. We discussed that on the call. The consensus was that to keep things simple we'd attach the engineered vs natural flag to the genotypes. And if the user combines an engineered single allele genotype and a natural one in the interface, the resulting multi-allele genotype should have the engineered flag.

So we have a plan but I think we should have another chat about this on Skype (including you this time) before starting the implementation. It involves changes how things are stored in the database so it would be good to be sure we've got it right.

jseager7 commented 3 years ago

Another factor that could affect this decision is how the variant status will be displayed in the user interface, depending on whether it's linked to each allele or the combined genotype.

Linking to alleles

Linking the variant status to the allele would be unambiguous in the annotation table rows:

and also in the drop-down menu when editing annotations:

Linking to genotypes

Linking the variant status to the genotype would mean we'd have to visually delimit the variant status from the individual alleles. For the annotation table rows, we could put the variant status on its own line:

but the display for the drop-down menu wouldn't be so simple. It seems the only sensible place for the variant status is after the final allele, delimited with extra white space:

but this display could be confused with the variant status only applying to the final allele in the list (TRI5+ in the example above).

I also thought about placing the variant status after the species information, but I thought this would make it seem like the variant status related to the species or strain, instead of the genotype:

pombase / canto

Recording variant status of genotypes (natural or engineered) #2346

Linking to alleles

Linking to genotypes