uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
6 stars 1 forks source link

fusion + variants Dies at transcript #299

Closed lydiayliu closed 2 years ago

lydiayliu commented 2 years ago
[ 2021-12-16 17:41:12 ] moPepGen callVariant started
[ 2021-12-16 17:42:37 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/Fusion/arriba-2.1.0/CPCG0100.winu.gvf loaded.
[ 2021-12-16 17:48:44 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gsnp/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-16 17:49:40 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-16 17:49:40 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/somaticsniper/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-16 17:49:40 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/pindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-16 17:50:04 ] Variant records sorted.
[ 2021-12-16 17:50:04 ] ENST00000454678.7
[ 2021-12-16 17:50:04 ] ENST00000453195.5
[ 2021-12-16 17:50:04 ] ENST00000416596.5
[ 2021-12-16 17:50:04 ] ENST00000418026.1
[ 2021-12-16 17:50:04 ] ENST00000434785.5
[ 2021-12-16 17:50:04 ] ENST00000487829.1
[ 2021-12-16 17:54:18 ] ENST00000635293.1
[ 2021-12-16 17:54:18 ] ENST00000610292.4
[ 2021-12-16 17:54:18 ] ENST00000269305.8
[ 2021-12-16 17:54:18 ] ENST00000620739.4
[ 2021-12-16 17:54:18 ] ENST00000617185.4
[ 2021-12-16 17:54:18 ] ENST00000455263.6
[ 2021-12-16 17:54:18 ] ENST00000420246.6
[ 2021-12-16 17:54:18 ] ENST00000622645.4
[ 2021-12-16 17:54:18 ] ENST00000610538.4
[ 2021-12-16 17:54:18 ] ENST00000445888.6
[ 2021-12-16 17:54:18 ] ENST00000619485.4
[ 2021-12-16 17:54:18 ] ENST00000509690.5
[ 2021-12-16 17:54:18 ] ENST00000514944.5
[ 2021-12-16 17:54:18 ] ENST00000505014.5
[ 2021-12-16 17:54:18 ] ENST00000604348.5
[ 2021-12-16 17:54:18 ] ENST00000503591.1
[ 2021-12-16 17:54:18 ] ENST00000448463.2
[ 2021-12-16 17:54:23 ] ENST00000428809.5
[ 2021-12-16 17:54:30 ] ENST00000432621.5
[ 2021-12-16 17:54:35 ] ENST00000653756.1
[ 2021-12-16 17:55:24 ] ENST00000265138.4

I've tried it twice and the run gets Killed at this transcript on F32 using all 62.76GiB of mem. Could try on F72 but there's likely a problem?

zhuchcn commented 2 years ago

I'll look into this transcript

zhuchcn commented 2 years ago

is ENST00000265138.4 the last transcript ID being printed out?

lydiayliu commented 2 years ago

yes in both cases. the log is here: /hot/users/yiyangliu/MoPepGen/Variant/Fusion/arriba-2.1.0/ssm/CPCG0100.winu.3f.log

lydiayliu commented 2 years ago

oh that's the same transcript as #297

zhuchcn commented 2 years ago

The gene of this transcript has three fusions located in its region, 2 of which only has one transcript, while the other has 21 transcripts, so in the gvf parsed from Arriba output, there are 23 entries associated with this transcript. The accepter gene is pretty large and carries a lot of variants. Some transcripts of it has more than 100 variants (snv/indel). So all of these make the graph really big. So after applying all variants into the ThreeFrameTVG, there are in total of 15600 nodes. This is just the unaligned variant graph, before fitting into codons. I haven't successfully translate and create the cleavage graph yet, and I'm believe it is going to be even larger.

To fully resolve this issue, I can't think of a way other than creating a 'splice graph' where each node is the sequence between any two splice sites, which is going to be a big project. Otherwise, I think we have to use some rules to limit the size of the graph, for example limiting the number of nucleotide of the accepter transcript, or maybe limit the number of transcripts (for example only consider breakpoint in exon). Any thoughts?

lydiayliu commented 2 years ago

To fully resolve this issue, I can't think of a way other than creating a 'splice graph' where each node is the sequence between any two splice sites, which is going to be a big project.

yeah let's put that on hold XD

I have a question actually. Sooo with the current set up, if we have a fusion event that involves 2 donor transcripts and 21 acceptor transcripts, are we doing the following:

is that correct? if so, why not pair one donor transcript with one acceptor transcript at a time?

zhuchcn commented 2 years ago

That's actually not a bad idea at all!

lydiayliu commented 2 years ago

didn't die here as per f7d21e9