Fusion variable peptide IDs make `filterFasta` fail

zhuchcn commented 2 years ago

Currently the variant peptides with fusion are labeled as FUSION-<donor_gene_id>:<donor_breakpoint>-<accepter_gene_id>:<accepter_breakpoint>, which causes a problem to filterFasta. In filterFasta we take a gene expression table, which has the abundance of each transcript. But the transcript ID isn't present in the fusion variant peptide label, so it won't be able to match to any transcript abundance. Previously the ID is in this style: FUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint>. This was changed because we now consider the combination of donor and accepter transcripts when the breakpoints are intronic (fusion callers only call fusion between genes). And the breakpoints are now in gene coordinate. So maybe we can just change it to this way FUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint> again but for the breakpoint, we still use gene coordinate. Let me know what you think @lydiayliu

lydiayliu commented 2 years ago

sorry im confused XD what I'm understanding is

need to use transcript ID for proper filter FASTA cuz we use a transcript expression table in filterFASTA
but can't use transcript ID + transcript coordinates cuz breakpoints can be intronic and thus impossible
so how would one distinguish if we are using gene coordinate or transcript coordinate if the fusion is represented as FUSION-<donor_tx_id>:<donor_breakpoint>-<accepter_tx_id>:<donor_breakpoint>? (I'm assuming it's supposed to be <accepter_tx_id>:<accepter_breakpoint>??)

to take a step back, do we need to do filterFASTA for fusion? fusion has their own quantification scheme based on junction reads, and that should be enough for filtering "expressed" and "high confidence" fusions right? I guess same for alternative splicing and circRNA. For circRNA especially, do we want the expression of the linear transcript to be used to filter for the circRNA?

just though of something, if people input a RSEM.gene.expression table with gene IDs instead of the RSEM.transcript.expression table that has transcript IDs, do we currently support that?

zhuchcn commented 2 years ago

Your three points are all correct. (sorry for the typo on point 3). That's a good point, I'm fine with not to filter fusion, alternative splicing and circRNA giving the gene expression. But do you think it's still a good idea to use gene ID in fusion peptide ID?

just though of something, if people input a RSEM.gene.expression table with gene IDs instead of the RSEM.transcript.expression table that has transcript IDs, do we currently support that?

No it won't work. Do you think it's necessary to support this?

lydiayliu commented 2 years ago

But do you think it's still a good idea to use gene ID in fusion peptide ID?

I want to keep things as consistent in the notation of non-canonical peptides as psossible XD So we can easily tally the "number of noncanonical peptides" per transcript. But it seems like in this case it is not possible to represent fusion peptides with transcript IDs? we must use gene IDs?

No it won't work. Do you think it's necessary to support this?

I dont know actually. Will have to look at how many non-canonical peptides are shared between transcripts of the same gene. We can put it on the back burner for now.

zhuchcn commented 2 years ago

But it seems like in this case it is not possible to represent fusion peptides with transcript IDs? we must use gene IDs?

The breakpoint is the tricky part. How about transcript ID + breakpoint in gene coordinate? One transcript has only one gene ID associated with it. We can maybe add a suffix of 'g' to represent it's gene coordiante, like 'FUSION-ENST0001:100g-ENST0002:200g'? Ugly?

Will have to look at how many non-canonical peptides are shared between transcripts of the same gene. We can put it on the back burner for now.

Sounds good!

lydiayliu commented 2 years ago

We can maybe add a suffix of 'g' to represent it's gene coordiante, like 'FUSION-ENST0001:100g-ENST0002:200g'? Ugly?

Lol kind of ugly? Do we use the fusion IDs for anything? What about cases where there is a fusion with variants on it? are the variants presented in transcript coordinates or gene coordinates? different for exon vs intron variants?

zhuchcn commented 2 years ago

What about cases where there is a fusion with variants on it?

Like this: 'FUSION-ENST0001:100-ENST0002:200|1-SNV-98-A-T|2-SNV-202-G-A'. Variants are actually all represented in gene coordinate. I think the good think of this is you can tell two peptides for different transcripts of the same gene that carries the exact same mutation. And then intronic and exonic variants all use the same coordinate.

lydiayliu commented 2 years ago

ok gene coordinate it is! anything stopping us from using gene coordinate in everything? SNV, INDEL, rna-editing, alternative splicing, circRNA, etc?

zhuchcn commented 2 years ago

No, everything is in gene coordinate! Just to confirm. This is what we are using?

FUSION-ENST0001:100-ENST0002:200

lydiayliu commented 2 years ago

FUSION-ENST0001:100-ENST0002:200

I think this looks good! where you thinking of any other variations?

uclahs-cds / package-moPepGen

Fusion variable peptide IDs make `filterFasta` fail #378