Recording variant status of genotypes (natural or engineered)

ValWood commented 3 years ago

For pathogens and host genotypes, many natural variants occur and for these, we want to capture the differences to some nominally WT variant. We need to be able to distinguish when a variant is naturally occurring or 'engineered' for any specific allele.

So, we would like an additional field in the genotype pop up to be able to select one of either I) Natural variant (NV) or ii) Engineered variant (EV).

We also thought that later if researchers took a natural variant, and then engineered a different residue, we would be able to combine these as multi-allele phenotypes. Although I am not sure about this? it might imply that 2 copies of the gene are present. I have forgotten how this is specified. If not, maybe this is something we can look into ( this is more future-proofing, although the NV/EV distinction is required now I don't think we have examples of editing to a natural variant right now @CuzickA can confirm)

jseager7 commented 3 years ago

This is almost certainly going to require database changes, and it sounds like it would make the most sense to record this as part of the genotype, probably on the same modal (pop-up) as where the strain is specified:

@kimrutherford Since this property seems to be specific to PHI-base, would it make sense to add a data column to the genotype table (following the convention used in the annotation table) to store all the miscellaneous data about genotypes? That way it will at least be easier to extend in future.

jseager7 commented 3 years ago

We also need to decide whether or not we want to include the variant type in the genotype display name, as we already do with the genotype background and allele type.

ValWood commented 3 years ago

We thought that NV or EV could be included (maybe mouse over to see the full expanded abbreviation)

jseager7 commented 3 years ago

Here's an example of what the inline genotype display name currently looks like:

TRI5+[WT product] (bkg: background) (strain: PH-1)

(note that the example is in pathogen-host mode; in single organism mode you wouldn't see the strain information)

Where would you like to include the NV or EV abbreviation in this display name? Should we introduce new brackets for it, or include it in one of the existing sets of brackets?

CuzickA commented 3 years ago

Here's an example of what the inline genotype display name currently looks like:

TRI5+[WT product] (bkg: background) (strain: PH-1)

(note that the example is in pathogen-host mode; in single organism mode you wouldn't see the strain information)

Where would you like to include the NV or EV abbreviation in this display name? Should we introduce new brackets for it, or include it in one of the existing sets of brackets?

Maybe after [WT product] before (bkg: background)?

CuzickA commented 3 years ago

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

ValWood commented 3 years ago

Yes, after the genotype, before the background.

Note we said this would always be required, but for WT it would not be relevant (although I guess it would not harm to add NV). We really only need NV for the known differences (i.e amino acid change) related to some canonical form.

ValWood commented 3 years ago

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

I don't remember/understand this. It would always be NV unless a change would made by an experimenter and then it would be EV? Am I missing something?

CuzickA commented 3 years ago

We also discussed in the case of 'nv' that it would be useful to have an option for 'nv-unknown' and 'nv-known' where the sequence variation could be recorded.

I don't remember/understand this. It would always be NV unless a change would made by an experimenter and then it would be EV? Am I missing something?

We were discussing eg avrSEN1 where the nv resulted in an early STOP vs atr1/RPP1 paper where there were 5 different pathogen strains which would be nv but the actual changes in sequence were not recorded. In the avrSEN1 example would we just have genotype WT, nv and strain name? or do we also want to capture the detail of the truncation reported in the paper?

ValWood commented 3 years ago

The NV/EV call be associated with any allele-type? (in fact it would not usually be WT ). This was proposed as an additional field? The allele variant would be curated (e.g A123L) and NV could be associated with this. We decided that we wouldn't curate natural variation as WT?

CuzickA commented 3 years ago

The NV/EV call be associated with any allele-type? (in fact it would not usually be WT ). This was proposed as an additional field? The allele variant would be curated (e.g A123L) and NV could be associated with this. We decided that we wouldn't curate natural variation as WT?

Thanks @ValWood that makes sense. So for the atr1/RPP1 example it probably would be 'WT nv strain' but for the avrSen1 example we can capture the known allele variant in the genotype and label nv.

ValWood commented 3 years ago

or unknown NV. It seems inconsistent to use WT for ones we don't know.... I haven't really thought about this though.

CuzickA commented 3 years ago

Maybe we need 'ev', 'nv' for the known variant alleles and 'nvu' for those captured as WT?

jseager7 commented 3 years ago

Edit: updated to reflect that we actually agreed on two variant options.

Based on discussion in the last call, we've decided to add a new field to the genotype creation modal that will capture the variant status. My current plan is to allow two options:

Experimental variant (EV)
Natural variant (NV)

Once the variant status is selected, it will be shown in abbreviated form in the genotype display name (as EV or NV).

The field could be implemented either as radio buttons (see here for an example) or a drop-down menu:

If we use radio buttons, there would be two mutually exclusive options for EV or NV.
If we use a drop-down menu, we would have two options in the menu (as described above), and the menu would default to a placeholder.

@ValWood @CuzickA Are you happy with the names chosen for these options? Are you happy with using a drop-down menu for the field, or would you prefer radio buttons?

@kimrutherford Once this is all decided, I'm going to need your help making schema changes to the genotype table to allow this variant information to be recorded.

jseager7 commented 3 years ago

Note also that once this is implemented, the variant status will presumably have to be retroactively applied to every existing genotype in PHI-Canto. Based on querying the JSON export, we have approximately 300 genotypes (as of 14 September 2020). It would help if we had a sensible default for the cases where the variant status hasn't been curated; presumably 'Natural variant unknown' won't be suitable if some of these old genotypes are experimental variants.

CuzickA commented 3 years ago

Hi @jseager7 thanks for writing up the above meeting notes.

I may have misunderstood but I thought that we were planning on having just the two options Engineered variant (EV) Natural variant (NV)

and then in the allele type dropdown list wild type (reference) wild type (other)

jseager7 commented 3 years ago

I may have misunderstood but I thought that we were planning on having just the two options

@CuzickA Thanks for the reminder. But what should we do about the old genotypes that have not been curated with a variant status? Will we need a placeholder option for these cases, or should we just leave the status empty?

CuzickA commented 3 years ago

@CuzickA Thanks for the reminder. But what should we do about the old genotypes that have not been curated with a variant status? Will we need a placeholder option for these cases, or should we just leave the status empty?

I'm not sure. What do you think will be the easiest way for me to make the appropriate edits to the genotypes? I guess if the status is empty I would know it needed updating. It would be good to have a current JSON export of all the sessions in case I need to refer back to the 'old genotypes'.

I'm also happy to have a drop down menu for the new EV NV option.

jseager7 commented 3 years ago

I'm also happy to have a drop down menu for the new EV NV option.

I think as long as we only have two options, then radio buttons would be a better choice. (If we had three or four options, then either field type would be okay.)

What do you think will be the easiest way for me to make the appropriate edits to the genotypes? I guess if the status is empty I would know it needed updating.

It might be difficult to discern whether the variant status is empty if it's only shown embedded in the display name. Compare the following two examples:

TRI5+[WT product] (bkg: background) (strain: PH-1)
TRI5+[WT product][NV] (bkg: background) (strain: PH-1)

...and that's just for a simple genotype. However, if we add a column showing the variant type to the genotype tables (on the Genotype Management page), it would be much easier to check.

Note that allowing the field to be blank for old genotypes implies that the data is optional (and more or less requires it be optional at the database level). My question is: do we want to require a variant status for all new genotypes, or can it be optional there as well?

CuzickA commented 3 years ago

...and that's just for a simple genotype. However, if we add a column showing the variant type to the genotype tables (on the Genotype Management page), it would be much easier to check.

This seems like a good idea.

Note that allowing the field to be blank for old genotypes implies that the data is optional (and more or less requires it be optional at the database level). My question is: do we want to require a variant status for all new genotypes, or can it be optional there as well?

I think we want to require a variant status for all new genotypes. Once we start trialling these new options out we should be able to flag up any genotypes that don't fit into this schema.

ValWood commented 3 years ago

note that [WT product] refers to the expression level.

Since NV/EV refers to the ~genotype~ allele, it should be

WT[NV] [WT product]

ValWood commented 3 years ago

I think we want to require a variant status for all new genotypes.

I agree, the assumption is that it will always be one or the other.

CuzickA commented 3 years ago

note that [WT product] refers to the expression level.

Since NV/EV refers to the ~genotype~ allele, it should be

WT[NV] [WT product]

Yes, I agree with this.

jseager7 commented 3 years ago

Since NV/EV refers to the allele, it should be WT[NV] [WT product]

@ValWood So for the example given above (sgo1+), I'm assuming the following display name would be correct?

sgo+[NV] [WT product]

Or did you mean the following:

sgo+ WT[NV] [WT product]

jseager7 commented 3 years ago

Here's a mock-up of how the picker for the allele variant might look:

Note that the inputs don't have to be on the same line as shown above. We can put them on individual lines, which is more consistent with the appearance of the expression level picker:

CuzickA commented 3 years ago

Here's a mock-up of how the picker for the allele variant might look:

Note that the inputs don't have to be on the same line as shown above. We can put them on individual lines, which is more consistent with the appearance of the expression level picker:

I like the second option if we are not restricted on vertical space. Did we want the 'allele variant' to be above the 'allele type' or below it (but above the allele expression)?

CuzickA commented 3 years ago

@ValWood So for the example given above (sgo1+), I'm assuming the following display name would be correct?

sgo+[NV] [WT product]

I think its this option as the '+' already indicates WT

jseager7 commented 3 years ago

Did we want the 'allele variant' to be above the 'allele type' or below it (but above the allele expression)?

I don't think we made a decision, but if the allele variant ever constrains the allele type (meaning we won't want to show some allele types for natural variants or engineered variants), then we should show it above the Allele type field. Otherwise, it doesn't matter where it goes.

ValWood commented 3 years ago

I think its this option as the '+' already indicates WT yes

CuzickA commented 3 years ago

If the display looks like this sgo1+[NV] [WT product] for allele type 'wild type' being represented with '+'

how will it change with the proposed new allele types wild type-reference wild type-other

CuzickA commented 3 years ago

After today's meeting we decided to compile a list of reference proteome strains and add '-ref' or similar to these strain names as a tag. We are using the reference proteome selected by UniProt. Where there is more than one strain we will select the first published genome. (Is there a ticket for this?)

We are still finding it tricky to make a decision on capturing 'variant status nv/ev' and 'wild type -ref/other'. Here is a list of possible terms that we could include in the allele type dropdown menu

wild type-ref (Query: is this for when the sequence is same as the reference proteome, rather than indicting wild type function which is captured in AE infective ability in the gene-for-gene flow??) wt-nv-deletion (assume wt=wt-other and not wt-ref?) wt-nv-disruption wt-nv-unknown wt-nv-amino acid insertion wt-nv-amino acid substitution(s) wt-nv-amino acid insertion and substitution wt-nv-amino acid insertion and deletion wt-nv-partial deletion and amino acid change wt-nv-partial deletion, amino acid wt-nv-nucleotide insertion wt-nv-nucleotide substitution(s) wt-nv-partial deletion, nucleotide wt-nv-nonsense mutation wt-nv-other

ev-wild type (Query: is this needed for overexpression studies?) ev-deletion ev-disruption ev-unknown ev-amino acid insertion ev-amino acid substitution(s) ev-amino acid insertion and substitution ev-amino acid insertion and deletion ev-partial deletion and amino acid change ev-partial deletion, amino acid ev-nucleotide insertion ev-nucleotide substitution(s) ev-partial deletion, nucleotide ev-nonsense mutation ev-transformant ev-other

What do you think @ValWood ?

I'm still a bit confused about the capturing of 'wild type'. There seem to be several meanings -reference strain -other natural strain -function of allele (KHK)

ValWood commented 3 years ago

I'm still a bit confused about the capturing of 'wild type'.

I am still confused by the meaning of this. It should really mean one thing, or if it means 2 things we need 2 separate data-types....

ev-wild type (Query: is this needed for overexpression studies?) I don't think so, because the expression is captured separately from the allele type

ValWood commented 3 years ago

Another issue if these are precomposed in a single pulldown, how will the user know when NV and EV refer to in these labels?

Also do you want wt in front of all of the natural variations? Some will not be wt?

CuzickA commented 3 years ago

I'm still a bit confused about the capturing of 'wild type'.

I am still confused by the meaning of this. It should really mean one thing, or if it means 2 things we need 2 separate data-types....

Yes, this is why we devised wild type-reference wild type-other

although in yesterday's discussion it sounded as if the team wanted to move away from this idea.

CuzickA commented 3 years ago

Another issue if these are precomposed in a single pulldown, how will the user know when NV and EV refer to in these labels?

Also do you want wt in front of all of the natural variations? Some will not be wt?

I guess the NV and EV would have to be spelt out in full. (The mocked up example above with the NV, EV options would be clearer).

I wasn't sure about the 'wt' prefix. It depends on our definition of wild type. Are we defining WT as only the reference strain? or as any naturally occurring collected strain? If it is the latter, all the nv options would be WT-other.

From memory only ~50% of the species in the PHI4.8 data release had reference strain proteomes in UniProt. @jseager7 is repeating this search with the latest PHI4.10 dataset.

ValWood commented 3 years ago

I was more concerned that the WT in these cases is really an inference from WT-phenotype in another experiment. This seems a bit strange and possibly problematic, but I can see that you require the information that this allele behaves like a wt allele.

It might OK as long as we are clear that what we are saying here is that this is a WT genotype AND it does mean that we would not be able to capture any of the natural variation WT if it is known. Although this may not matter because for the experimental outcome it isn't important. Also, the information should be accessible as we have the locus and strain information recorded which is the important part. So I think it is probably fine, I am probably worrying about this unnecessarily.

However, I find it odd that the natural variant changes have wt in front of them, because the reason we are recording these changes is that they don't behave like the reference WT, so here we would be using the wt meaning differently. Shouldn't these just omit the wt and be called nv? Retaining the WT designation for "any genotype that behaves like the WT reference strain"

You should also spell out nv and ev in full the dropdown (although these can be abbreviated in the genotype view and in the annotations).

I think users will find it confusing that if you have a non-reference strain, with a 'wt-' acting allele where you did not know the sequence your option would be wt-nv-unknown it seems that this option should be wt-non-reference (because if it is 'unknown' sequence you won't know if there is any natural variation or not). The nv options are for where you know the sequence or a naturally occurring variant and you can record differences from the canonical WT.

Does this make sense?

ValWood commented 3 years ago

Also, we might be able to remove some infrequently used allele types and use 'other'. This selection is largely based on PomBase observed types, but you don't need to keep them all.

ValWood commented 3 years ago

I guess the NV and EV would have to be spelt out in full. (The mocked up example above with the NV, EV options would be clearer).

I wasn't sure about the 'wt' prefix. It depends on our definition of wild type. Are we defining WT as only the reference strain? or as any naturally occurring collected strain? If it is the latter, all the nv options would be WT-other.

OK I see I am repeating myself! I don' know if that is good or bad.

ValWood commented 3 years ago

wild type-ref (Query: is this for when the sequence is same as the reference proteome, rather than indicting wild type function which is captured in AE infective ability in the gene-for-gene flow??)

Yesterday, my understanding was that you wanted to use wt for any allele which behaved like the identified wt (which might be reference, or non reference) -which is fine. I keep wondering why you therefore need to distinguish between a reference-sequence and non sequence wt ? I think somebody explained this yesterday but could we confirm the reason.

I can only think of the scenarios behaves like WT or some natural variation or some engineered variation

So we need to be clear about why we need to specify the 'reference' information in the workflow. Particularly since the information about the reference strain for the species will be recorded going forward with the strain information.

My question is. why do we need to say whether the wt is reference sequence or not? (It seems that these are really quite different things anyway, a WT allele for any given locus may or may not be WT in the reference strain).

Am I oversimplifying by saying that all we really want to record is: a) this locus behaves like WT for this allele (in which case we are not recording any variation info - but this is would be available from the locus/strain data) b) any natural variation that does not behave like WT c) any engineered variation

Does this minimal set of information make any of the information you want to get out of a gene-for-gene information impossible?

CuzickA commented 3 years ago

Thanks @ValWood

So it sound like we are moving towards defining 'WT' as the allele having 'wild type function', rather than exactly matching the sequence of a reference or other strain.

How do we conclude what the 'wild type function' of the gene is? For pathogen effectors I guess this would be to cause disease, for host resistance genes this would be to recognise at least one effector and trigger resistance. We can capture this information in the gene-for-gene AE which is good.

jseager7 commented 3 years ago

So it sound like we are moving towards defining 'WT' as the allele having 'wild type function', rather than exactly matching the sequence of a reference or other strain.

It seems like the common (dictionary) definition of wild-type is some gene or allele that is most prevalent in a natural population. Is it likely to cause confusion if we use a definition based on gene function? Are these definitions even compatible?

ValWood commented 3 years ago

Are these definitions even compatible?

I don't know. I think the way Kim-H-K want to use the WT in gene-for-gene. might not be compatible with the use in non-gene for gene annotation.

I don't know how we could know which allele is most prevalent. So as Kim said the community usually decide on an gene-by-gene basis what is WT based on observation.

We often have a 'known WT acting allele' and either a) we don't know the precise genotype or b) the sequence may not be exactly the same as the designated WT allele. My understanding is that in these cases we want to be able to say this is a WT-acting allele and we don't need to record any sequence detail. We are getting hung up on the semantics of how to name and define this.

We don't just want to say that it is some unknown natural variant, because in the gene-for-gene outcomes it is important to know whether the disease is a result of variation in the pathogen or host allele. In this case if the WT pathogen normally causes disease and there is no disease in the host we know this must be due to some change in the host.

Is this what we are trying to say: WT allele An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype?

It seems that a precise definition of how PHI-base interprets a WT designation would be a good starting point. If we have this it should be easier to move to a solution.

Whatever the definition ends up being it needs to be true across the board for gene-for-gene and for non-gene for gene. We can't have 2 differentuses of WT so if something different is meant we need a different label.

CuzickA commented 3 years ago

past idea new idea?? based on WT allele function

here are the AE

In the new idea mockup both ATR1(emoy2) and ATR1(cala2) would be 'WT' as they can function to cause disease on the correct host strain. Nd (not shown) and Ws respectively. ATR1(emoy2) is recognised by the host R gene in strain Ws which blocks disease formation. If we are comparing 'cala2' to the reference strain 'emoy2' sequence we can say nv-unknown but do we still say WT allele function. Key information here is that there is a natural variation between strain emoy2 and cala2 which determines whether RPP1 from host strain Ws can recognise it to trigger defence.

CuzickA commented 3 years ago

WT allele An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype?

The tricky part with the gene-for-gene interactions is the combination of both pathogen and host strains. In example above, both ATR1emoy2 and ATR1cala2 have 'WT effector function' of causing disease when they are on the correct host strain to enable this.

ValWood commented 3 years ago

WT allele An allele that contains the designated WT sequence for a given locus, or any naturally occurring variant in a related strain which exhibits the same phenotype. In pathogen host interactions this includes an allele which is is shown to induce disease on at least one susceptible host.

Note that I am not necessarily suggesting we use this as the WT definition. I'm only trying to figure out the usage scope.

CuzickA commented 3 years ago

It would be good to move forward and make a decision about this NV/EV tag. All the annotated sessions will need to be updated with NV/EV prior to making training materials and screenshots for the PHI-Canto publication.

I think the minimal information we are trying to disambiguate is when a genotype records an alteration is this due to natural sequence variation between this strain and another strain or due to engineered variation. One of the difficulties here is not linking to the exact sequence for a given strain.

Can we make a statement in our documentation that 1) Different strain names may or may not indicate different allele sequences or function. 2) When natural variation is responsible for a known alteration within the allele this is recorded with the genotype using a tag NV to indicate the alteration occurred in the wild rather than being engineered in the lab. 3) When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I'm not sure if this helps the discussion but it seems to all be getting a bit complicated and I wanted to reduce it down to address the initial issue.

Let me know what you think :-)

jseager7 commented 3 years ago

Different strain names may or may not indicate different allele sequences or function.

Is this referring to the strain name on the genotype? I'm a bit concerned about the implication that the strain name doesn't indicate different allele sequences, because that begs the question: why indicate the strain at all? The only way I can see this not mattering is if the strain only contains sequence differences outside of the genes / alleles of interest (by 'alleles of interest' I mean the alleles curated in the session), but I would've expected most authors will be using a strain precisely because it contains existing variations to some allele of interest that they want to study. Is that true?

When natural variation is responsible for a known alteration within the allele this is recorded with the genotype using a tag NV to indicate the alteration occurred in the wild rather than being engineered in the lab.

This sounds fine, although we might want to define exactly what the scope of 'natural variation' is – would controlled breeding programmes count as natural variation? – and maybe include some examples of the reference point for natural variation. Maybe one example of natural variation is when a subset of a wheat population expresses greater resistance to some pathogen because of a mutation that was not experimentally (deliberately) induced.

When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I don't think this is necessary, because the current plan is to require a variant status for all genotypes: see https://github.com/pombase/canto/issues/2346#issuecomment-692032467. The user would be forced to pick NV or EV, so there's no need to make assumptions.

jseager7 commented 3 years ago

I also don't think we've resolved the following points from @ValWood, at least not in this issue:

It seems that a precise definition of how PHI-base interprets a WT designation would be a good starting point. If we have this it should be easier to move to a solution.

Whatever the definition ends up being it needs to be true across the board for gene-for-gene and for non-gene for gene. We can't have 2 different uses of WT so if something different is meant we need a different label.

CuzickA commented 3 years ago

Different strain names may or may not indicate different allele sequences or function.

Is this referring to the strain name on the genotype? I'm a bit concerned about the implication that the strain name doesn't indicate different allele sequences, because that begs the question: why indicate the strain at all? The only way I can see this not mattering is if the strain only contains sequence differences outside of the genes / alleles of interest (by 'alleles of interest' I mean the alleles curated in the session), but I would've expected most authors will be using a strain precisely because it contains existing variations to some allele of interest that they want to study. Is that true?

Yes, I was referring to the strain name on the genotype. And yes, the rest of your comment follows my thinking here. In most cases there will be variation to the strain alleles being studied, but I thought it would be better to keep the option open in case there is no variation within the studied gene and the strain variation is elsewhere in the genome. Some studies may collect a variety of eg pathogen strains from the field and test on host for phenotype. We may want to curate this information but the authors themselves may not know whether the allele sequences are the same or not unless they sequence and this is not always done. Again it comes down to the difficulty of not knowing the allele sequence from the strain in many of the cases.

When the NV tag is NOT used within an altered genotype, it can be assumed that the genotype was engineered in a lab.

I don't think this is necessary, because the current plan is to require a variant status for all genotypes: see #2346 (comment). The user would be forced to pick NV or EV, so there's no need to make assumptions.

I couldn't quite decide here on whether it would be better to force a choice of NV or EV for all genotypes or just to add NV in the examples where we have a WT strain that has a known alteration that is captured in the genotype. In these cases the allele type would not be wild type it would be amino acid substitution or similar. I thought the NV with a clear definition would help explain these known natural variation genotypes. In cases where the strain sequence was unknown, the genotype would have the strain name and be wild type. In cases where the genotype were EV, the alleles would usually not be wild type and if they were they would have altered expression. If we did decide to add NV/EV to all genotypes then all of the control genotypes would need the NV tag and then we have the issue mentioned above about specifying which strains are WT-reference or WT-other.

I'm not sure which idea would work best here, but I thought it was worth suggesting this alternative idea to try and move was away from needing to put too much emphasis on a WT sequence or function. This opens the can of worms about reference genomes, non-reference genomes and pan-genomes.

jseager7 commented 3 years ago

Some studies may collect a variety of eg pathogen strains from the field and test on host for phenotype. We may want to curate this information but the authors themselves may not know whether the allele sequences are the same or not unless they sequence and this is not always done.

Maybe this is a silly question, but if the authors don't know if the allele sequences are the same – presumably because they didn't perform any sequencing – how do they know what the strains are?

pombase / canto

Recording variant status of genotypes (natural or engineered) #2346