uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
5 stars 1 forks source link

How to handle variants at edges of genes/transcripts/exons #84

Closed zhuchcn closed 2 years ago

zhuchcn commented 2 years ago

74 and #81 are similar cases, that the genomics variant (usually deletion) includes the end of a gene or transcript. Although we haven't seen it yet, but I think it'll also be an issue when the exon end is included.

1. Transcript

This is the case when the deletion overlaps with the transcript end position. A transcript annotated by GENCODE/ENSEMBL may or may not have the mRNA_end_NF tag, which indicates that the end of the transcription is not found or not clear. A transcript also may or may not have a cds_end_NF tag, which indicates that the end of the protein coding region (i.e., the stop codon) is not confirmed.

Here is a contingency table of mRNA_end_NF and cds_end_NF from GENCODE v38. Seems like if a transcript is not mRNA_end_NF, it's most likely not cds_end_NF, with small number of exceptions. And if a transcript is mRNA_end_NF, it may or may not be cds_end_NF.

cds_end_NF
mRNA_end_NF
False True
False 233084 2
True 2894 22165

mRNA_end_NF -

This is the case the transcription end is known. Because the transcription termination process is complex and still largely unclear, so it is difficult at this point to determine the effect of a deletion that overlaps with the end of the transcript. It may or may not alter the transcript sequence, it may cause the transcription termination to be latter or even earlier. So unless we are able to fully understand the transcription termination process, we shall ignore those deletions of transcripts that the transcript end is known.

Although most transcripts without the mRNA_end_NF tag should also not be cds_end_NF, there still can be exceptions. And in those cases, the deletion will be treated as truncation, and the last artificially truncated peptide will not be reported.

mRNA_end_NF +

If the transcript has the mRNA_end_NF tag, the real transcription end is not confirmed. But because there is evidence that the sequence before the annotated transcription stop site is indeed transcribed, we then consider the deletion as a truncation of the transcript. This applies to both cds_end_NF + and -.

2. Gene

A the end of a gene must be the end of at least one transcript (a gene can have multiple transcripts). If a deletion overlaps with the gene end, we just treat it as how it overlaps the transcript end. So same as the section above.

zhuchcn commented 2 years ago

@lydiayliu I looked at the GENCODE v34. A gene can have multiple transcripts. And for each gene, it has at least one transcript, that the transcription end is the same as the gene end. So the gene end (annotated by gencode) must be the end of at least one transcript.

I think there are then 2 situations, i.e. transcripts with or without mRNA_end_NF. I think we have a conclusion for the former (with mRNA_end_NF). And for the later one, that we now where the transcript end really is. I think the main question is, how does such a mutation affect the transcription termination. With a scamming on the literature, it seems like the transcription terminates when RNA Pol sees a terminator or transcription stop site sequence. Transcription termination is a pretty complex process, and the mechanism isn't fully clear. I would imagine that the terminator/transcription stop site sequence must have some kind of tolerance to mutations. So my feeling is, we are probably not sure what kind of mutations will cause the terminator to become unfunctional.

lydiayliu commented 2 years ago

So the gene end (annotated by gencode) must be the end of at least one transcript.

Makes sense!

The example will result a mutated sequence of [100,200), [300,398)

Is the example a case where the variant will be ignored because of tag mRNA_end_NF?

So my feeling is, we are probably not sure what kind of mutations will cause the terminator to become unfunctional.

I agree, transcription termination sounds like a very hairy and complex process. There is no real way for us to distinguish cases where the transcription termination will stop at the known end or somewhere upstream / downstream.

Come to think about it though, for transcript frameshifts, I ignored this end problem and essentially continued extending the transcript (beyond the last exon) until I find a stop codon in the genomic sequence. I think whatever we do about the transcript end mutations should be consistent with what is currently being done with frameshifts?

zhuchcn commented 2 years ago

Is the example a case where the variant will be ignored because of tag mRNA_end_NF?

Actually the variant is not ignored, but is just treated as a truncation of the transcript.

I ignored this end problem and essentially continued extending the transcript (beyond the last exon) until I find a stop codon in the genomic sequence. I think whatever we do about the transcript end mutations should be consistent with what is currently being done with frameshifts?

Good point. In moPepGen, if a mutation happens at the stop codon (which means the stop codon is known) and results a stop-lost mutation, it keeps looking for new stop codon, until the end of the transcript.

lydiayliu commented 2 years ago

Do you think it's ok to treat all cases of stop lost / end of transcript mutations until the known end of the transcript after which the protein is just truncated?

zhuchcn commented 2 years ago

Any reason you think this is not appropriate?

lydiayliu commented 2 years ago

Not at all! I think consistency is key here, we can explain away any design choices anyways XD

zhuchcn commented 2 years ago

I do concern about the very last peptide (after cleavage). This is for any protein/transcript that the stop codon is not known. If we want, we could choose not to report the very last one. But all the other cleaved peptides should be OK?

lydiayliu commented 2 years ago

I was thinking about the same thing. The artifically truncated peptide should not really exist in real life. I'm ok with not reporting anything that involves the very last peptide (unless there is a stop codon, for example in the 3' UTR?)

zhuchcn commented 2 years ago

I think we are clear with gene and transcripts. I updated the top comment, as the conclusion of our discussion. Now what about exons? How should we treat the deletion that overlaps with the end (both ends) of a exon? I guess this will affect the splicing.

lydiayliu commented 2 years ago

I want to avoid getting into the problem of splice variant consequence prediction. If the variant really altered splicing, we should be able to see it in one of the novel junction prediction tools.

The case of insertion is more striaghtforward (still use the known exon-exon junction), for deletion technically the canonical splice site could be gone altogether. I think these cases should be rare, what does VEP currently annotate them as?

zhuchcn commented 2 years ago

That totally makes sense. I think VEP also ignores those, because in your filtered VEP results (*.gencode.aa.txt), I don't see any variants like this, but not sure if VEP has any special annotation (no access to the cluster for now)

lydiayliu commented 2 years ago

Cluster cleared! Time for investigation? XD

lydiayliu commented 2 years ago

Found some examples!

VEP entry, note no amino acid changes noted and ? in coordinates:

chr1_939399_CCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA/-  chr1:939399-939446      -       ENSG00000187634.12      ENST00000341065.8       Transcript      splice_donor_variant,coding_sequence_variant,intron_variant     416-?   417-?   139-?   -       -       -       IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

GTF for exon:

chr1    HAVANA  exon    939275  939412  .       +       .       gene_id "ENSG00000187634.12"; transcript_id "ENST00000341065.8"; gene_type "protein_coding"; gene_name "SAMD11"; transcript_type "protein_coding"; transcript_name "SAMD11-201"; exon_number 5; exon_id "ENSE00001708361.1"; level 2; protein_id "ENSP00000349216.4"; transcript_support_level "5"; hgnc_id "HGNC:28706"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000040719.1"; havana_transcript "OTTHUMT00000097860.1";

On the acceptor end

rs139855605     chr1:9244920-9244927    -       ENSG00000049239.13      ENST00000377403.7       Transcript      splice_acceptor_variant,5_prime_UTR_variant,intron_variant      ?-296   -       -       -       -       -       IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
rs139855605     chr1:9244920-9244927    -       ENSG00000049239.13      ENST00000602477.1       Transcript      splice_acceptor_variant,coding_sequence_variant,intron_variant  ?-178   ?-26    ?-9     -       -       -       IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

I guess the commonality is that these are annotated as both coding_sequence_variant and intron_variant, HIGH impact and with ? in coordinates

zhuchcn commented 2 years ago

Cool! And this is from the unfiltered vep result? Is it also in the filtered result (gencode.aa.txt)?

zhuchcn commented 2 years ago

I think we want to keep intron variant, so maybe we can for now ignore any splice_acceptor_variant and splice_donor_variant?

lydiayliu commented 2 years ago

Cool! And this is from the unfiltered vep result? Is it also in the filtered result (gencode.aa.txt)?

Not in gencode.aa.txt since I filter using the Amino acids column which is - in these entries.

I think we want to keep intron variant, so maybe we can for now ignore any splice_acceptor_variant and splice_donor_variant?

How about using the ? in positions as the filtering criteria? I feel like splice_acceptor_variant can be fully intron_variant as well, this should be a special case where the mutation spans both the coding sequence and the intron?

Example: `chr1153886523-/T chr1:153886522-153886523 T ENSG00000143614.10 ENST00000634791.1 Transcript splice_acceptor_variant - - -

zhuchcn commented 2 years ago
chr1    HAVANA  exon    153886424       153886521       .       -       .       gene_id "ENSG00000143614.10"; transcript_id "ENST00000634791.1"; gene_type "protein_coding"; gene_name "GATAD2B"; transcript_type

Seems like this variant chr1_153886523_-/T actually alters the second nucleotide of the splicing site dinucleotides? Splicing site should be 153886522-153886523

lydiayliu commented 2 years ago

Lol I know nothing about splicing XD It should be no different than an intron_variant though when parsed?

zhuchcn commented 2 years ago

Here is a crush curse on splicing RNA Splicing: Introns, Exons and Spliceosome (see section How Splicing Occurs). Should we also avoid those kind of variants? This variant, altering the consensus dinucleotide splicing site, will inhibit the splicing event here. So if we want to consider this, we should then retain the intron?

lydiayliu commented 2 years ago

Thanks!

I think the job of predicting / detecting retained introns should be left to rMATS XD I'm leaning towards just treating this as any other intron_variant. I don't think there will be many opportunities for it to become coding though, unless it's in like a retained intron or ciRNA?

zhuchcn commented 2 years ago

I'm not sure whether rMATS is able to call for retained introns caused by splicing site altering mutations. My concern is this type of variants might be different from other intron_variant, it is indeed a "splicing site altering variant". But it doesn't seem to affect rMATS or circRNA prediction, so I'm OK keeping it as intron_variant.

Back to the question, I think we are agreed on ignoring the variants that spans over the splicing site. I'm a little concerning about using the question mark. I'm thinking using the annotations might be more explicit. How about we filter out all variants that has (splice_donor_variant or splice_donor_variant) and coding_sequence_variant? Explicitly saying that we don't want the variant if it alters the splicing site, and it also spans to at least one exonic nucleotide.

pboutros commented 2 years ago

I'm not sure whether rMATS is able to call for retained introns caused by splicing site altering mutations. My concern is this type of variants might be different from other intron_variant, it is indeed a "splicing site altering variant". But it doesn't seem to affect rMATS or circRNA prediction, so I'm OK keeping it as intron_variant.

Hey guys, I want to remind you of a general philosophy for this tool (which you both know already!). You will very much want it to be as tool-agnostic as humanly possible. If somebody swaps out rMATS for DEXSeq, or any other change, things should mostly go okay.

So that means the question above ("whether rMATS is able to call for retained introns caused by splice site altering mutations" is in some way irrelevant. As long as any existing or potential tool would do it, you want to code around it (within reason).

I know you know that, but a reminder since specific tools got brought up!

lydiayliu commented 2 years ago

rMATS is faster to type than alternative splicing calling tool XD Though I can see the confusion over lingo for readers of this thread, sorry!

OK to summarize, seems like splice_acceptor_variant and I suppose splice_donor_variant can exist on their own completely within the intron. We will treat these as intron_variants.

I like the idea of using both coding_sequence_variant and splice_XXX as a filter. There are also the equivalent non_coding_transcript_exon_variant in the noncoding transcripts, for example: chr1_48055191_AGCAGGGTGAGTA/- chr1:48055191-48055203 - ENSG00000226133.6 ENST00000439795.1 Transcript splice_donor_variant,non_coding_transcript_exon_variant,intron_variant 180-? - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

The counter example for the noncoding case is below, where it should be treated as an exon_variant: rs113565775 chr1:3062191-3062192 G ENSG00000177133.11 ENST00000420957.1 Transcript splice_region_variant,non_coding_transcript_exon_variant 202-203 - - - - - IMPACT=LOW;STRAND=-1;SOURCE=GENCODEv34

I think I finally understand the differences between these:

splice_acceptor_variant  A splice variant that changes the 2 base region at the 3' end of an intron
splice_donor_variant    A splice variant that changes the 2 base region at the 5' end of an intron
splice_region_variant   A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron

Essentially the clearest cases we are ignoring are like the examples below:

splice_donor_variant,coding_sequence_variant,intron_variant
splice_acceptor_variant,coding_sequence_variant,intron_variant
splice_donor_variant,non_coding_transcript_exon_variant,intron_variant
splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant

rs143953991 chr11:77184726-77184747 - ENSG00000137474.22 ENST00000620575.4 Transcript splice_acceptor_variant,coding_sequence_variant,intron_variant ?-3779 ?-3513 ?-1171 - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

rs371278550 chr10:28941738-28941749 - ENSG00000232624.8 ENST00000646289.1 Transcript splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant ?-1175 - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

Found a funny one that seems to have deleted the entire intron? chr10_2501338_ATCCTTCCTCATGGTCGCGCACCCGGTTCACCTCACCAACAAAGCGCGGTTTCCTCCCTCGCCCTTCTTCCTCGGGATCGTGCACGCGGTTCACGTCACCAACAAAGCGCGGTTTCCTCCCTCGCCAGACTTCCTCATGGTCGCGCACCCGTTTCACCTCACCAACAAAGCGCGGTTTCCTCCCTCGCG/- chr10:2501338-2501526 - ENSG00000234170.6 ENST00000665692.1 Transcript splice_donor_variant,splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant 104-166 - - - - - IMPACT=HIGH;STRAND=-1;SOURCE=GENCODEv34

chr10   HAVANA  exon    2501495 2501557 .       -       .       gene_id "ENSG00000234170.6"; transcript_id "ENST00000665692.1"; gene_type "lncRNA"; gene_name "LINC02645"; transcript_type "lncRNA"; transcript_name "LINC02645-211"; exon_number 2; exon_id "ENSE00003862263.1"; level 2; hgnc_id "HGNC:54129"; tag "basic"; havana_gene "OTTHUMG00000017549.1"; havana_transcript "OTTHUMT00000505270.1";
chr10   HAVANA  exon    2501306 2501368 .       -       .       gene_id "ENSG00000234170.6"; transcript_id "ENST00000665692.1"; gene_type "lncRNA"; gene_name "LINC02645"; transcript_type "lncRNA"; transcript_name "LINC02645-211"; exon_number 3; exon_id "ENSE00003864120.1"; level 2; hgnc_id "HGNC:54129"; tag "basic"; havana_gene "OTTHUMG00000017549.1"; havana_transcript "OTTHUMT00000505270.1";
lydiayliu commented 2 years ago

I also found examples like these rs1799759 chr12:9093581-9093585 - ENSG00000175899.15 ENST00000546069.1 Transcript splice_acceptor_variant,intron_variant,NMD_transcript_variant - - - - - - IMPACT=HIGH;STRAND=-1;SOURCE=GENCODEv34 This particular example seems to be completely within the intron

chr12   HAVANA  UTR     9094973 9095084 .       -       .       gene_id "ENSG00000175899.15"; transcript_id "ENST00000546069.1"; gene_type "protein_coding"; gene_name "A2M"; transcript_type "nonsense_mediated_decay"; transcript_name "A2M-213"; exon_number 5; exon_id "ENSE00003641762.1"; level 2; protein_id "ENSP00000438599.1"; transcript_support_level "5"; hgnc_id "HGNC:7"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000150267.1"; havana_transcript "OTTHUMT00000400124.1";
chr12   HAVANA  UTR     9093465 9093579 .       -       .       gene_id "ENSG00000175899.15"; transcript_id "ENST00000546069.1"; gene_type "protein_coding"; gene_name "A2M"; transcript_type "nonsense_mediated_decay"; transcript_name "A2M-213"; exon_number 6; exon_id "ENSE00003679705.1"; level 2; protein_id "ENSP00000438599.1"; transcript_support_level "5"; hgnc_id "HGNC:7"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000150267.1"; havana_transcript "OTTHUMT00000400124.1";

I don't think for us the distinction between NMD_transcript and just plain non_coding_transcript makes a difference... Though there are no special tag for NMD_exon_variant unfortunately...

There are also variants spanning intron and UTR junctions: chr17_7571888_TATATATATATATATATATATAAAAATATAT/- chr17:7571888-7571918 - ENSG00000161956.13 ENST00000429205.6 Transcript splice_acceptor_variant,3_prime_UTR_variant,intron_variant ?-2171 - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34

I guess we'll just have to play by ear a little and not overly complicate things? We can always resolve more cases when they come up, but for now, when in doubt just ignore the variant?

zhuchcn commented 2 years ago

Thanks Lydia for the looking into it closely! Seems like there are several different cases, I guess the best way to identify variants that spans over a splicing site would be that it alters at least one splicing site (or intronic to be easier) nucleotide and at least one exonic nucleotide. If a transcript has this intron retained as what is annotated in the GTF file, this variant should be retained. So basically instead of replying on VEP's annotation, we annotate it on our own.

Paul made a good point. I think we should not complete ignore those variants. The VaraintRecordPool class has three attributes, transcriptional, intronic, and genetic.

https://github.com/uclahs-cds/private-moPepGen/blob/ee77c1321bea041e56066fb0075001c5428beada/moPepGen/seqvar/VariantRecordPool.py#L16-L29

For splicing variants that are completely intronic, we can place them under the intronic attribute. For splicing variants that are also exonic, we can't "translate" the genetic coordinate to transcriptional coordinate, we can keep them in the genetic attribute. If a parser is able to predict the effect of splicing altering variants, we will develop it's corresponding parsers, and the effect should be represented as either insertion, deletion or substitution.