uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
6 stars 1 forks source link

fusion + variants invalid amino acid #302

Closed lydiayliu closed 2 years ago

lydiayliu commented 2 years ago

this is the furthest we've gotten!!!

[ 2021-12-18 01:22:05 ] moPepGen callVariant started
[ 2021-12-18 01:24:55 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/Fusion/arriba-2.1.0/CPCG0100.winu.gvf loaded.
[ 2021-12-18 01:31:40 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gsnp/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 01:32:38 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/gindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 01:32:38 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/somaticsniper/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 01:32:39 ] Variant file /hot/users/yiyangliu/MoPepGen/Parser/VEP/gencode/pindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 01:34:56 ] 1000 transcripts processed.
[ 2021-12-18 01:36:01 ] 2000 transcripts processed.
[ 2021-12-18 01:37:12 ] 3000 transcripts processed.
[ 2021-12-18 01:39:20 ] 4000 transcripts processed.
[ 2021-12-18 01:40:31 ] 5000 transcripts processed.
[ 2021-12-18 01:42:32 ] 6000 transcripts processed.
[ 2021-12-18 01:43:39 ] 7000 transcripts processed.
[ 2021-12-18 01:45:07 ] 8000 transcripts processed.
[ 2021-12-18 01:47:25 ] 9000 transcripts processed.
[ 2021-12-18 01:48:33 ] 10000 transcripts processed.
[ 2021-12-18 01:50:39 ] 11000 transcripts processed.
[ 2021-12-18 01:51:59 ] 12000 transcripts processed.
[ 2021-12-18 01:54:09 ] 13000 transcripts processed.
[ 2021-12-18 01:55:22 ] 14000 transcripts processed.
[ 2021-12-18 01:56:23 ] 15000 transcripts processed.
[ 2021-12-18 01:58:17 ] 16000 transcripts processed.
[ 2021-12-18 01:59:11 ] 17000 transcripts processed.
[ 2021-12-18 02:00:08 ] 18000 transcripts processed.
[ 2021-12-18 02:00:33 ] Exception raised from ENST00000623909.1
Traceback (most recent call last):
  File "/usr/local/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 77, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 127, in call_variant_peptide
    peptides = call_peptide_main(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 215, in call_peptide_main
    return pgraph.call_variant_peptides(miscleavage=miscleavage)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 616, in call_variant_peptides
    self.call_and_stage_unknown_orf(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 836, in call_and_stage_unknown_orf
    traversal.pool.add_miscleaved_sequences(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/VariantPeptideDict.py", line 181, in add_miscleaved_sequences
    raise ValueError('Invalid amino acid symbol found in the sequence.')
ValueError: Invalid amino acid symbol found in the sequence.
lydiayliu commented 2 years ago

0244525 I'm trying with this commit

it's strange though cuz no message since loading all the files and it's been a few hours

I have no name!@ed2e81cb1a85:/$ moPepGen callVariant \
>     --input-variant /data/Parser/Fusion/arriba-2.1.0/${c}.winu.gvf \
>         /data/Parser/VEP/gencode/gsnp/${b} \
>         /data/Parser/VEP/gencode/gindel/${b} \
>         /data/Parser/VEP/gencode/somaticsniper/${b} \
>         /data/Parser/VEP/gencode/pindel/${b} \
>     --index-dir /data/Index/GRCh38-EBI-GENCODE34/ \
>     --output-fasta /data/Variant/Fusion/arriba-2.1.0/ssm/${c}.3f.fasta
[ 2021-12-18 23:10:13 ] moPepGen callVariant started
[ 2021-12-18 23:11:43 ] Variant file /data/Parser/Fusion/arriba-2.1.0/CPCG0100.winu.gvf loaded.
[ 2021-12-18 23:18:22 ] Variant file /data/Parser/VEP/gencode/gsnp/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 23:19:20 ] Variant file /data/Parser/VEP/gencode/gindel/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 23:19:20 ] Variant file /data/Parser/VEP/gencode/somaticsniper/CPCG0100.gencode.tsv.gvf loaded.
[ 2021-12-18 23:19:20 ] Variant file /data/Parser/VEP/gencode/pindel/CPCG0100.gencode.tsv.gvf loaded.

had to kill it

^C[ 2021-12-19 01:01:45 ] Exception raised from ENST00000577672.1
Traceback (most recent call last):
  File "/usr/local/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 77, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 127, in call_variant_peptide
    peptides = call_peptide_main(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 214, in call_peptide_main
    pgraph.create_cleavage_graph(rule=rule, exception=exception)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 485, in create_cleavage_graph
    self.fit_into_cleavage_multiple_upstream(cur)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 529, in fit_into_cleavage_multiple_upstream
    sites = cur.seq.find_all_cleave_and_stop_sites(rule=self.rule,
  File "/usr/local/lib/python3.8/site-packages/moPepGen/aa/AminoAcidSeqRecord.py", line 213, in find_all_cleave_and_stop_sites
    cleavage_sites = list(self.iter_enzymatic_cleave_sites(rule=rule,
KeyboardInterrupt
zhuchcn commented 2 years ago

Can you try with --verbose-level 2? By default it's 1 which doesn't print out tx IDs. It's probably running

zhuchcn commented 2 years ago

I pushed a new commit to #289 last night. There seemed to be an issue. Could you try with the latest commit?

lydiayliu commented 2 years ago

I am still seeing this, with a different transcript this time, commit ddaa904

[ 2021-12-20 03:38:06 ] 88000 transcripts processed
...
[ 2021-12-20 03:39:44 ] ENST00000449085.3
[ 2021-12-20 03:39:45 ] ENST00000607833.5
[ 2021-12-20 03:39:45 ] ENST00000464592.5
[ 2021-12-20 03:39:45 ] ENST00000463639.1
[ 2021-12-20 03:39:45 ] ENST00000469132.1
[ 2021-12-20 03:39:45 ] ENST00000229829.7
[ 2021-12-20 03:39:45 ] ENST00000490305.5
[ 2021-12-20 03:39:45 ] ENST00000419277.5
[ 2021-12-20 03:39:45 ] ENST00000479107.1
[ 2021-12-20 03:39:46 ] Exception raised from ENST00000479107.1
Traceback (most recent call last):
  File "/usr/local/bin/moPepGen", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/__main__.py", line 77, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 127, in call_variant_peptide
    peptides = call_peptide_main(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/cli/call_variant_peptide.py", line 215, in call_peptide_main
    return pgraph.call_variant_peptides(miscleavage=miscleavage)
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 630, in call_variant_peptides
    self.call_and_stage_unknown_orf(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/PeptideVariantGraph.py", line 850, in call_and_stage_unknown_orf
    traversal.pool.add_miscleaved_sequences(
  File "/usr/local/lib/python3.8/site-packages/moPepGen/svgraph/VariantPeptideDict.py", line 181, in add_miscleaved_sequences
    raise ValueError('Invalid amino acid symbol found in the sequence.')
ValueError: Invalid amino acid symbol found in the sequence.
zhuchcn commented 2 years ago

Hmm, I extracted the variants of this transcript from the GVF files and it seems to run successfully. 🤔

[ 2021-12-20 09:54:31 ] moPepGen callPeptides started
[ 2021-12-20 09:54:31 ] Annotation GTF loaded.
[ 2021-12-20 09:54:31 ] Proteome FASTA loaded.
[ 2021-12-20 09:54:31 ] Genome assembly FASTA loaded.
[ 2021-12-20 09:54:31 ] canonical peptide pool generated.
[ 2021-12-20 09:54:31 ] Variant file test/files/comb/CPCG0100_ENST00000623909.1/gsnp.gvf loaded.
[ 2021-12-20 09:54:31 ] Variant file test/files/comb/CPCG0100_ENST00000623909.1/gindel.gvf loaded.
[ 2021-12-20 09:54:31 ] Variant file test/files/comb/CPCG0100_ENST00000623909.1/arriba.gvf loaded.
[ 2021-12-20 09:54:31 ] Variant file test/files/comb/CPCG0100_ENST00000623909.1/pindel.gvf loaded.
[ 2021-12-20 09:54:31 ] Variant file test/files/comb/CPCG0100_ENST00000623909.1/somaticsniper.gvf loaded.
[ 2021-12-20 09:54:32 ] Variant peptide FASTA file written to disk.
lydiayliu commented 2 years ago

ENST00000479107.1 ?

zhuchcn commented 2 years ago

Oh wait, what was I doing..

zhuchcn commented 2 years ago

I'm able to reproduce it. I think it's again a node being missed when creating the cleavage graph 😵

zhuchcn commented 2 years ago

I think I fixed it in ddaa904 . Seems like under certain circumstances some nodes are visited multiple times. Potentially because of too many inframe mutations in a local region.

lydiayliu commented 2 years ago

T_T got another transcript, past 88000 though!!!!

ENST00000460633.1

zhuchcn commented 2 years ago

Yeah, 33819ca messed it up. It was fine on ddaa904

zhuchcn commented 2 years ago

Think I fixed it. I'll add some test cases. Are you have to run the whole thing again? 😢

zhuchcn commented 2 years ago

Sorry it's not completely fixed yet.

zhuchcn commented 2 years ago

OK finally fixed in 06a8020

lydiayliu commented 2 years ago

ok gonna run it overnight! fingers crossed! fe2f16e

lydiayliu commented 2 years ago

no longer see this