Closed zhuchcn closed 2 years ago
sorry im confused XD what I'm understanding is
FUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint>
? (I'm assuming it's supposed to be <accepter_tx_id>:<accepter_breakpoint>
??)to take a step back, do we need to do filterFASTA for fusion? fusion has their own quantification scheme based on junction reads, and that should be enough for filtering "expressed" and "high confidence" fusions right? I guess same for alternative splicing and circRNA. For circRNA especially, do we want the expression of the linear transcript to be used to filter for the circRNA?
just though of something, if people input a RSEM.gene.expression table with gene IDs instead of the RSEM.transcript.expression table that has transcript IDs, do we currently support that?
Your three points are all correct. (sorry for the typo on point 3). That's a good point, I'm fine with not to filter fusion, alternative splicing and circRNA giving the gene expression. But do you think it's still a good idea to use gene ID in fusion peptide ID?
just though of something, if people input a RSEM.gene.expression table with gene IDs instead of the RSEM.transcript.expression table that has transcript IDs, do we currently support that?
No it won't work. Do you think it's necessary to support this?
But do you think it's still a good idea to use gene ID in fusion peptide ID?
I want to keep things as consistent in the notation of non-canonical peptides as psossible XD So we can easily tally the "number of noncanonical peptides" per transcript. But it seems like in this case it is not possible to represent fusion peptides with transcript IDs? we must use gene IDs?
No it won't work. Do you think it's necessary to support this?
I dont know actually. Will have to look at how many non-canonical peptides are shared between transcripts of the same gene. We can put it on the back burner for now.
But it seems like in this case it is not possible to represent fusion peptides with transcript IDs? we must use gene IDs?
The breakpoint is the tricky part. How about transcript ID + breakpoint in gene coordinate? One transcript has only one gene ID associated with it. We can maybe add a suffix of 'g' to represent it's gene coordiante, like 'FUSION-ENST0001:100g-ENST0002:200g'? Ugly?
Will have to look at how many non-canonical peptides are shared between transcripts of the same gene. We can put it on the back burner for now.
Sounds good!
We can maybe add a suffix of 'g' to represent it's gene coordiante, like 'FUSION-ENST0001:100g-ENST0002:200g'? Ugly?
Lol kind of ugly? Do we use the fusion IDs for anything? What about cases where there is a fusion with variants on it? are the variants presented in transcript coordinates or gene coordinates? different for exon vs intron variants?
What about cases where there is a fusion with variants on it?
Like this: 'FUSION-ENST0001:100-ENST0002:200|1-SNV-98-A-T|2-SNV-202-G-A'. Variants are actually all represented in gene coordinate. I think the good think of this is you can tell two peptides for different transcripts of the same gene that carries the exact same mutation. And then intronic and exonic variants all use the same coordinate.
ok gene coordinate it is! anything stopping us from using gene coordinate in everything? SNV, INDEL, rna-editing, alternative splicing, circRNA, etc?
No, everything is in gene coordinate! Just to confirm. This is what we are using?
FUSION-ENST0001:100-ENST0002:200
FUSION-ENST0001:100-ENST0002:200
I think this looks good! where you thinking of any other variations?
Currently the variant peptides with fusion are labeled as
FUSION-<donor_gene_id>:<donor_breakpoint>-<accepter_gene_id>:<accepter_breakpoint>
, which causes a problem tofilterFasta
. InfilterFasta
we take a gene expression table, which has the abundance of each transcript. But the transcript ID isn't present in the fusion variant peptide label, so it won't be able to match to any transcript abundance. Previously the ID is in this style:FUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint>
. This was changed because we now consider the combination of donor and accepter transcripts when the breakpoints are intronic (fusion callers only call fusion between genes). And the breakpoints are now in gene coordinate. So maybe we can just change it to this wayFUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint>
again but for the breakpoint, we still use gene coordinate. Let me know what you think @lydiayliu