Closed zhuchcn closed 3 years ago
@lydiayliu I looked at the GENCODE v34. A gene can have multiple transcripts. And for each gene, it has at least one transcript, that the transcription end is the same as the gene end. So the gene end (annotated by gencode) must be the end of at least one transcript.
I think there are then 2 situations, i.e. transcripts with or without mRNA_end_NF
. I think we have a conclusion for the former (with mRNA_end_NF
). And for the later one, that we now where the transcript end really is. I think the main question is, how does such a mutation affect the transcription termination. With a scamming on the literature, it seems like the transcription terminates when RNA Pol sees a terminator or transcription stop site sequence. Transcription termination is a pretty complex process, and the mechanism isn't fully clear. I would imagine that the terminator/transcription stop site sequence must have some kind of tolerance to mutations. So my feeling is, we are probably not sure what kind of mutations will cause the terminator to become unfunctional.
So the gene end (annotated by gencode) must be the end of at least one transcript.
Makes sense!
The example will result a mutated sequence of [100,200), [300,398)
Is the example a case where the variant will be ignored because of tag mRNA_end_NF?
So my feeling is, we are probably not sure what kind of mutations will cause the terminator to become unfunctional.
I agree, transcription termination sounds like a very hairy and complex process. There is no real way for us to distinguish cases where the transcription termination will stop at the known end or somewhere upstream / downstream.
Come to think about it though, for transcript frameshifts, I ignored this end problem and essentially continued extending the transcript (beyond the last exon) until I find a stop codon in the genomic sequence. I think whatever we do about the transcript end mutations should be consistent with what is currently being done with frameshifts?
Is the example a case where the variant will be ignored because of tag
mRNA_end_NF
?
Actually the variant is not ignored, but is just treated as a truncation of the transcript.
I ignored this end problem and essentially continued extending the transcript (beyond the last exon) until I find a stop codon in the genomic sequence. I think whatever we do about the transcript end mutations should be consistent with what is currently being done with frameshifts?
Good point. In moPepGen, if a mutation happens at the stop codon (which means the stop codon is known) and results a stop-lost mutation, it keeps looking for new stop codon, until the end of the transcript.
Do you think it's ok to treat all cases of stop lost / end of transcript mutations until the known end of the transcript
after which the protein is just truncated?
Any reason you think this is not appropriate?
Not at all! I think consistency is key here, we can explain away any design choices anyways XD
I do concern about the very last peptide (after cleavage). This is for any protein/transcript that the stop codon is not known. If we want, we could choose not to report the very last one. But all the other cleaved peptides should be OK?
I was thinking about the same thing. The artifically truncated peptide should not really exist in real life. I'm ok with not reporting anything that involves the very last peptide (unless there is a stop codon, for example in the 3' UTR?)
I think we are clear with gene and transcripts. I updated the top comment, as the conclusion of our discussion. Now what about exons? How should we treat the deletion that overlaps with the end (both ends) of a exon? I guess this will affect the splicing.
I want to avoid getting into the problem of splice variant consequence prediction. If the variant really altered splicing, we should be able to see it in one of the novel junction prediction tools.
The case of insertion is more striaghtforward (still use the known exon-exon junction), for deletion technically the canonical splice site could be gone altogether. I think these cases should be rare, what does VEP currently annotate them as?
That totally makes sense. I think VEP also ignores those, because in your filtered VEP results (*.gencode.aa.txt), I don't see any variants like this, but not sure if VEP has any special annotation (no access to the cluster for now)
Cluster cleared! Time for investigation? XD
Found some examples!
VEP entry, note no amino acid changes noted and ?
in coordinates:
chr1_939399_CCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA/- chr1:939399-939446 - ENSG00000187634.12 ENST00000341065.8 Transcript splice_donor_variant,coding_sequence_variant,intron_variant 416-? 417-? 139-? - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
GTF for exon:
chr1 HAVANA exon 939275 939412 . + . gene_id "ENSG00000187634.12"; transcript_id "ENST00000341065.8"; gene_type "protein_coding"; gene_name "SAMD11"; transcript_type "protein_coding"; transcript_name "SAMD11-201"; exon_number 5; exon_id "ENSE00001708361.1"; level 2; protein_id "ENSP00000349216.4"; transcript_support_level "5"; hgnc_id "HGNC:28706"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000040719.1"; havana_transcript "OTTHUMT00000097860.1";
On the acceptor end
rs139855605 chr1:9244920-9244927 - ENSG00000049239.13 ENST00000377403.7 Transcript splice_acceptor_variant,5_prime_UTR_variant,intron_variant ?-296 - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
rs139855605 chr1:9244920-9244927 - ENSG00000049239.13 ENST00000602477.1 Transcript splice_acceptor_variant,coding_sequence_variant,intron_variant ?-178 ?-26 ?-9 - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
I guess the commonality is that these are annotated as both coding_sequence_variant
and intron_variant
, HIGH
impact and with ?
in coordinates
Cool! And this is from the unfiltered vep result? Is it also in the filtered result (gencode.aa.txt)?
I think we want to keep intron variant, so maybe we can for now ignore any splice_acceptor_variant
and splice_donor_variant
?
Cool! And this is from the unfiltered vep result? Is it also in the filtered result (gencode.aa.txt)?
Not in gencode.aa.txt
since I filter using the Amino acids column which is -
in these entries.
I think we want to keep intron variant, so maybe we can for now ignore any splice_acceptor_variant and splice_donor_variant?
How about using the ?
in positions as the filtering criteria? I feel like splice_acceptor_variant
can be fully intron_variant
as well, this should be a special case where the mutation spans both the coding sequence and the intron?
Example: `chr1153886523-/T chr1:153886522-153886523 T ENSG00000143614.10 ENST00000634791.1 Transcript splice_acceptor_variant - - -
chr1 HAVANA exon 153886424 153886521 . - . gene_id "ENSG00000143614.10"; transcript_id "ENST00000634791.1"; gene_type "protein_coding"; gene_name "GATAD2B"; transcript_type
Seems like this variant chr1_153886523_-/T
actually alters the second nucleotide of the splicing site dinucleotides? Splicing site should be 153886522-153886523
Lol I know nothing about splicing XD It should be no different than an intron_variant
though when parsed?
Here is a crush curse on splicing RNA Splicing: Introns, Exons and Spliceosome (see section How Splicing Occurs). Should we also avoid those kind of variants? This variant, altering the consensus dinucleotide splicing site, will inhibit the splicing event here. So if we want to consider this, we should then retain the intron?
Thanks!
I think the job of predicting / detecting retained introns should be left to rMATS XD I'm leaning towards just treating this as any other intron_variant
. I don't think there will be many opportunities for it to become coding
though, unless it's in like a retained intron
or ciRNA
?
I'm not sure whether rMATS is able to call for retained introns caused by splicing site altering mutations. My concern is this type of variants might be different from other intron_variant
, it is indeed a "splicing site altering variant". But it doesn't seem to affect rMATS or circRNA prediction, so I'm OK keeping it as intron_variant.
Back to the question, I think we are agreed on ignoring the variants that spans over the splicing site. I'm a little concerning about using the question mark. I'm thinking using the annotations might be more explicit. How about we filter out all variants that has (splice_donor_variant
or splice_donor_variant
) and coding_sequence_variant
? Explicitly saying that we don't want the variant if it alters the splicing site, and it also spans to at least one exonic nucleotide.
I'm not sure whether rMATS is able to call for retained introns caused by splicing site altering mutations. My concern is this type of variants might be different from other
intron_variant
, it is indeed a "splicing site altering variant". But it doesn't seem to affect rMATS or circRNA prediction, so I'm OK keeping it as intron_variant.
Hey guys, I want to remind you of a general philosophy for this tool (which you both know already!). You will very much want it to be as tool-agnostic as humanly possible. If somebody swaps out rMATS for DEXSeq, or any other change, things should mostly go okay.
So that means the question above ("whether rMATS is able to call for retained introns caused by splice site altering mutations" is in some way irrelevant. As long as any existing or potential tool would do it, you want to code around it (within reason).
I know you know that, but a reminder since specific tools got brought up!
rMATS is faster to type than alternative splicing calling tool
XD Though I can see the confusion over lingo for readers of this thread, sorry!
OK to summarize, seems like splice_acceptor_variant
and I suppose splice_donor_variant
can exist on their own completely within the intron
. We will treat these as intron_variants
.
I like the idea of using both coding_sequence_variant
and splice_XXX
as a filter. There are also the equivalent non_coding_transcript_exon_variant
in the noncoding transcripts, for example:
chr1_48055191_AGCAGGGTGAGTA/- chr1:48055191-48055203 - ENSG00000226133.6 ENST00000439795.1 Transcript splice_donor_variant,non_coding_transcript_exon_variant,intron_variant 180-? - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
The counter example for the noncoding case is below, where it should be treated as an exon_variant
:
rs113565775 chr1:3062191-3062192 G ENSG00000177133.11 ENST00000420957.1 Transcript splice_region_variant,non_coding_transcript_exon_variant 202-203 - - - - - IMPACT=LOW;STRAND=-1;SOURCE=GENCODEv34
I think I finally understand the differences between these:
splice_acceptor_variant A splice variant that changes the 2 base region at the 3' end of an intron
splice_donor_variant A splice variant that changes the 2 base region at the 5' end of an intron
splice_region_variant A sequence variant in which a change has occurred within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron
Essentially the clearest cases we are ignoring are like the examples below:
splice_donor_variant,coding_sequence_variant,intron_variant
splice_acceptor_variant,coding_sequence_variant,intron_variant
splice_donor_variant,non_coding_transcript_exon_variant,intron_variant
splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant
rs143953991 chr11:77184726-77184747 - ENSG00000137474.22 ENST00000620575.4 Transcript splice_acceptor_variant,coding_sequence_variant,intron_variant ?-3779 ?-3513 ?-1171 - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
rs371278550 chr10:28941738-28941749 - ENSG00000232624.8 ENST00000646289.1 Transcript splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant ?-1175 - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
Found a funny one that seems to have deleted the entire intron?
chr10_2501338_ATCCTTCCTCATGGTCGCGCACCCGGTTCACCTCACCAACAAAGCGCGGTTTCCTCCCTCGCCCTTCTTCCTCGGGATCGTGCACGCGGTTCACGTCACCAACAAAGCGCGGTTTCCTCCCTCGCCAGACTTCCTCATGGTCGCGCACCCGTTTCACCTCACCAACAAAGCGCGGTTTCCTCCCTCGCG/- chr10:2501338-2501526 - ENSG00000234170.6 ENST00000665692.1 Transcript splice_donor_variant,splice_acceptor_variant,non_coding_transcript_exon_variant,intron_variant 104-166 - - - - - IMPACT=HIGH;STRAND=-1;SOURCE=GENCODEv34
chr10 HAVANA exon 2501495 2501557 . - . gene_id "ENSG00000234170.6"; transcript_id "ENST00000665692.1"; gene_type "lncRNA"; gene_name "LINC02645"; transcript_type "lncRNA"; transcript_name "LINC02645-211"; exon_number 2; exon_id "ENSE00003862263.1"; level 2; hgnc_id "HGNC:54129"; tag "basic"; havana_gene "OTTHUMG00000017549.1"; havana_transcript "OTTHUMT00000505270.1";
chr10 HAVANA exon 2501306 2501368 . - . gene_id "ENSG00000234170.6"; transcript_id "ENST00000665692.1"; gene_type "lncRNA"; gene_name "LINC02645"; transcript_type "lncRNA"; transcript_name "LINC02645-211"; exon_number 3; exon_id "ENSE00003864120.1"; level 2; hgnc_id "HGNC:54129"; tag "basic"; havana_gene "OTTHUMG00000017549.1"; havana_transcript "OTTHUMT00000505270.1";
I also found examples like these
rs1799759 chr12:9093581-9093585 - ENSG00000175899.15 ENST00000546069.1 Transcript splice_acceptor_variant,intron_variant,NMD_transcript_variant - - - - - - IMPACT=HIGH;STRAND=-1;SOURCE=GENCODEv34
This particular example seems to be completely within the intron
chr12 HAVANA UTR 9094973 9095084 . - . gene_id "ENSG00000175899.15"; transcript_id "ENST00000546069.1"; gene_type "protein_coding"; gene_name "A2M"; transcript_type "nonsense_mediated_decay"; transcript_name "A2M-213"; exon_number 5; exon_id "ENSE00003641762.1"; level 2; protein_id "ENSP00000438599.1"; transcript_support_level "5"; hgnc_id "HGNC:7"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000150267.1"; havana_transcript "OTTHUMT00000400124.1";
chr12 HAVANA UTR 9093465 9093579 . - . gene_id "ENSG00000175899.15"; transcript_id "ENST00000546069.1"; gene_type "protein_coding"; gene_name "A2M"; transcript_type "nonsense_mediated_decay"; transcript_name "A2M-213"; exon_number 6; exon_id "ENSE00003679705.1"; level 2; protein_id "ENSP00000438599.1"; transcript_support_level "5"; hgnc_id "HGNC:7"; tag "mRNA_start_NF"; tag "cds_start_NF"; havana_gene "OTTHUMG00000150267.1"; havana_transcript "OTTHUMT00000400124.1";
I don't think for us the distinction between NMD_transcript
and just plain non_coding_transcript
makes a difference... Though there are no special tag for NMD_exon_variant
unfortunately...
There are also variants spanning intron
and UTR
junctions:
chr17_7571888_TATATATATATATATATATATAAAAATATAT/- chr17:7571888-7571918 - ENSG00000161956.13 ENST00000429205.6 Transcript splice_acceptor_variant,3_prime_UTR_variant,intron_variant ?-2171 - - - - - IMPACT=HIGH;STRAND=1;SOURCE=GENCODEv34
I guess we'll just have to play by ear a little and not overly complicate things? We can always resolve more cases when they come up, but for now, when in doubt just ignore the variant?
Thanks Lydia for the looking into it closely! Seems like there are several different cases, I guess the best way to identify variants that spans over a splicing site would be that it alters at least one splicing site (or intronic to be easier) nucleotide and at least one exonic nucleotide. If a transcript has this intron retained as what is annotated in the GTF file, this variant should be retained. So basically instead of replying on VEP's annotation, we annotate it on our own.
Paul made a good point. I think we should not complete ignore those variants. The VaraintRecordPool
class has three attributes, transcriptional
, intronic
, and genetic
.
For splicing variants that are completely intronic, we can place them under the intronic
attribute. For splicing variants that are also exonic, we can't "translate" the genetic coordinate to transcriptional coordinate, we can keep them in the genetic
attribute. If a parser is able to predict the effect of splicing altering variants, we will develop it's corresponding parsers, and the effect should be represented as either insertion, deletion or substitution.
74 and #81 are similar cases, that the genomics variant (usually deletion) includes the end of a gene or transcript. Although we haven't seen it yet, but I think it'll also be an issue when the exon end is included.
1. Transcript
This is the case when the deletion overlaps with the transcript end position. A transcript annotated by GENCODE/ENSEMBL may or may not have the
mRNA_end_NF
tag, which indicates that the end of the transcription is not found or not clear. A transcript also may or may not have acds_end_NF
tag, which indicates that the end of the protein coding region (i.e., the stop codon) is not confirmed.Here is a contingency table of
mRNA_end_NF
andcds_end_NF
from GENCODE v38. Seems like if a transcript is notmRNA_end_NF
, it's most likely notcds_end_NF
, with small number of exceptions. And if a transcript ismRNA_end_NF
, it may or may not becds_end_NF
.mRNA_end_NF
mRNA_end_NF
-This is the case the transcription end is known. Because the transcription termination process is complex and still largely unclear, so it is difficult at this point to determine the effect of a deletion that overlaps with the end of the transcript. It may or may not alter the transcript sequence, it may cause the transcription termination to be latter or even earlier. So unless we are able to fully understand the transcription termination process, we shall ignore those deletions of transcripts that the transcript end is known.
Although most transcripts without the
mRNA_end_NF
tag should also not becds_end_NF
, there still can be exceptions. And in those cases, the deletion will be treated as truncation, and the last artificially truncated peptide will not be reported.mRNA_end_NF
+If the transcript has the
mRNA_end_NF
tag, the real transcription end is not confirmed. But because there is evidence that the sequence before the annotated transcription stop site is indeed transcribed, we then consider the deletion as a truncation of the transcript. This applies to bothcds_end_NF
+ and -.2. Gene
A the end of a gene must be the end of at least one transcript (a gene can have multiple transcripts). If a deletion overlaps with the gene end, we just treat it as how it overlaps the transcript end. So same as the section above.