griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
141 stars 59 forks source link

ValueError: invalid literal for int() with base 10: '' when running pvactools #692

Closed iichelhadi closed 3 years ago

iichelhadi commented 3 years ago

Hi Susanna, Thank you for the reply. I repeated the whole process with a new reference from ensembl Homo_sapiens.GRCh38.dna.primary_assembly.fa hoping it might solve the issue. For VEP I am using version 103 and I use the fasta argument using the ensembl reference Homo_sapiens.GRCh38.dna.toplevel.fa which in terms of dna sequence should be identical to the Homo_sapiens.GRCh38.dna.primary_assembly.fa. When I tried running pvactools I got a different error.

Generating Variant Peptide FASTA and Key File
Traceback (most recent call last):
  File "/home/svu/phaei/.local/bin/pvacseq", line 8, in <module>
    sys.exit(main())
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/main.py", line 95, in main
    args[0].func.main(args[1])
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/run.py", line 122, in main
    pipeline.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 458, in execute
    self.generate_combined_fasta()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 173, in generate_combined_fasta
    generate_combined_fasta.main(params)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 155, in main
    generate_fasta(args.flanking_sequence_length, downstream_sequence_length, temp_dir, proximal_variants_tsv)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 102, in generate_fasta
    fasta_generator.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/fasta_generator.py", line 138, in execute
    position = int(line['protein_position'].split('-', 1)[0]) - 1
ValueError: invalid literal for int() with base 10: ''

the pvacseq command I ran: pvacseq run --iedb-install-directory /home/svu/phaei/IEDB --binding-threshold 500 --n-threads 24 --net-chop-method "cterm" --netmhc-stab --exclude-NAs \ --normal-sample-name 2440_N vcf_files2/2440_somatic.filtered.VEP.vcf 2440_T HLA-A*11:01 MHCflurry MHCnuggetsI NetMHC NetMHCpan pvacseq/2440/ The pvactools version I have installed 2.0.2

Would you still like me to open a new ticket for this issue?

regards

EL

susannasiebert commented 3 years ago

It looks like the VCF you provided is the unannotated one. Can you please also post your VEP-annotated 2440_somatic.filtered.VEP.vcf that you used as input to your run?

iichelhadi commented 3 years ago

It looks like the VCF you provided is the unannotated one. Can you please also post your VEP-annotated 2440_somatic.filtered.VEP.vcf that you used as input to your run?

Apologies. The VEP annotated vcf is quite large (270mb). hereby a onedrive link to download it from: https://nusu-my.sharepoint.com/:u:/r/personal/e0669599_u_nus_edu/Documents/New%20Lab%20Record/Lab%20Record/El/VEP_VCF/2440_somatic.filtered.VEP.vcf?csf=1&web=1

susannasiebert commented 3 years ago

Unfortunately, I'm getting the following error when trying to access this file: We're sorry, but ssiebert@email.wustl.edu can't be found in the nusu-my.sharepoint.com directory. Please try again later, while we try to automatically fix this for you. Do you need to give me permission to view this file?

iichelhadi commented 3 years ago

could you try this one https://www.dropbox.com/s/s001q01y5exlnn5/2440_somatic.filtered.VEP.vcf?dl=0

susannasiebert commented 3 years ago

yes, that worked. Thank you.

iichelhadi commented 3 years ago

yes, that worked. Thank you.

great. I hope this issues can be solved. really need this. good luck

susannasiebert commented 3 years ago

It looks like there are a few things going on with your VCF. The first is the error you posted and the other two are issues that arise after fixing the first issue.

I hope to get a new hotfix release out probably early next week that should address the first two issues. You might also consider running pVACseq with the --pass-only flag enabled to skip all of the variants that have filter tags for various quality issues.

iichelhadi commented 3 years ago

It looks like there are a few things going on with your VCF. The first is the error you posted and the other two are issues that arise after fixing the first issue.

* The variant at `chr4 112657326` is a frameshift that is also a stop_retained_variant. This variant should be skipped but isn't being caught and causing this error. I will add handling of this variant to pVACseq but in the meantime you should be able to just delete this variant from your VCF.

* The variant at `chr5 132595759` is a frameshift mutation but the protein change position is encoded in way that isn't currently supported. I will add support for it and pVACseq should be able to handle this variant afterwards. If you want to be able to run your VCF, you can remove it for now if you're ok with not getting predictions for it. Otherwise you'd need to wait for the next hotfix release.

* The variant at `chr7 20785319` has a discrepancy between the genome reference and the transcript reference that cannot be resolved by pVACseq. This variant will need to be deleted.

I hope to get a new hotfix release out probably early next week that should address the first two issues. You might also consider running pVACseq with the --pass-only flag enabled to skip all of the variants that have filter tags for various quality issues.

Thank you for the quick reply. I tried running with pass-only but it doesn't seem to help. removing the 3 entries does solve the issue. I have multiple VEP.vcfs with the same issues. is there a way to scan the VEP.vcf and remove these issues. Is that something that VAtools can do?

iichelhadi commented 3 years ago

I get the following error when running vatools

Traceback (most recent call last):
  File "/home/el/miniconda3/envs/vatools/bin/ref-transcript-mismatch-reporter", line 8, in <module>
    sys.exit(main())
  File "/home/el/miniconda3/envs/vatools/lib/python3.8/site-packages/vatools/ref_transcript_mismatch_reporter.py", line 148, in main
    position = int(protein_position) - 1
ValueError: invalid literal for int() with base 10: '71/98'

I ran pip show vatools to make sure it works and got the following output:

Name: vatools
Version: 5.0.0
Summary: A tool for annotating VCF files with expression and readcount data
Home-page: https://github.com/griffithlab/vatools
Author: Susanna Kiwala, Chris Miller
Author-email: help@vatools.org
License: MIT License
Location: /home/el/miniconda3/envs/vatools/lib/python3.8/site-packages
Requires: vcfpy, pysam, gtfparse, testfixtures, pandas
Required-by: 

Could these issues be related? regards El

susannasiebert commented 3 years ago

El, thank you for this issue report. It is definitely related and will require a bugfix in VAtools. I made issue https://github.com/griffithlab/VAtools/issues/49 to track this fix. I apologize for all the errors you are encountering.

iichelhadi commented 3 years ago

El, thank you for this issue report. It is definitely related and will require a bugfix in VAtools. I made issue griffithlab/VAtools#49 to track this fix. I apologize for all the errors you are encountering.

thank you for your effort.

susannasiebert commented 3 years ago

I just made a new release (2.0.4) that should fix the first two problems in the above list. VAtools has also been updated to fix the error you were seeing. After running the ref-transcript-mismatch-reporter on your VCF and removing the entry from the third point in the list above you should now be able to get a successful run. I will be closing this issue but please do reopen it if you're still encountering issues.

iichelhadi commented 3 years ago

I just made a new release (2.0.4) that should fix the first two problems in the above list. VAtools has also been updated to fix the error you were seeing. After running the ref-transcript-mismatch-reporter on your VCF and removing the entry from the third point in the list above you should now be able to get a successful run. I will be closing this issue but please do reopen it if you're still encountering issues.

I am a bit confused. I thought the current version of VAtools is 5.0. Am i using the correct tool? https://pypi.org/project/vatools/

susannasiebert commented 3 years ago

There have been two updates:

  1. VAtools has been updated to version 5.0.1. This should resolve the ValueError: invalid literal for int() with base 10: '71/98' error you were seeing when running the ref-transcript-mismatch-reporter.
  2. pVACtools has been updated to version 2.0.4. This should resolve the ValueError: invalid literal for int() with base 10: '' when running pvacseq run by excluding the stop_retained_variant (chr4 112657326) that was causing this issue. It will also solve the issue I discovered for chr5 132595759 having a unsupported format of the protein position field.
iichelhadi commented 3 years ago

There have been two updates:

1. VAtools has been updated to version 5.0.1. This should resolve the `ValueError: invalid literal for int() with base 10: '71/98'` error you were seeing when running the `ref-transcript-mismatch-reporter`.

2. pVACtools has been updated to version 2.0.4. This should resolve the `ValueError: invalid literal for int() with base 10: ''` when running `pvacseq run` by excluding the stop_retained_variant (`chr4 112657326`) that was causing this issue. It will also solve the issue I discovered for `chr5 132595759` having a unsupported format of the protein position field.

thank you. It seems to be working!

iichelhadi commented 3 years ago

There have been two updates:

1. VAtools has been updated to version 5.0.1. This should resolve the `ValueError: invalid literal for int() with base 10: '71/98'` error you were seeing when running the `ref-transcript-mismatch-reporter`.

2. pVACtools has been updated to version 2.0.4. This should resolve the `ValueError: invalid literal for int() with base 10: ''` when running `pvacseq run` by excluding the stop_retained_variant (`chr4 112657326`) that was causing this issue. It will also solve the issue I discovered for `chr5 132595759` having a unsupported format of the protein position field.

Hi Susanna, Thank you for the help last time. Unfortunately, I am having the same issue with another VEP.vcf file when running pvactools.

frameshift_variant transcript does not contain a FrameshiftSequence. Skipping.
chr8 89955283 CT C ENST00000409330
frameshift_variant transcript does not contain a FrameshiftSequence. Skipping.
chr9 121192171 A AT ENST00000373840
frameshift_variant transcript does not contain a FrameshiftSequence. Skipping.
chr9 121192171 A AT ENST00000451303
Completed
Generating Variant Peptide FASTA and Key File
Traceback (most recent call last):
  File "/home/svu/phaei/.local/bin/pvacseq", line 8, in <module>
    sys.exit(main())
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/main.py", line 95, in main
    args[0].func.main(args[1])
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/run.py", line 122, in main
    pipeline.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 458, in execute
    self.generate_combined_fasta()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/pipeline.py", line 173, in generate_combined_fasta
    generate_combined_fasta.main(params)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 155, in main
    generate_fasta(args.flanking_sequence_length, downstream_sequence_length, temp_dir, proximal_variants_tsv)
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/tools/pvacseq/generate_protein_fasta.py", line 102, in generate_fasta
    fasta_generator.execute()
  File "/home/svu/phaei/.local/lib/python3.8/site-packages/lib/fasta_generator.py", line 138, in execute
    position = int(line['protein_position'].split('-', 1)[0]) - 1
ValueError: invalid literal for int() with base 10: ''

I have run VAtools' ref-transcript-mismatch-reporter in hard mode to generate a new curated vcf file but the issue persists when running pvactools. This issue seems to happen with only one of my current VEP.vcf files. It looks like the previous issue and I checked the version of VAtools I have installed and it is version 5.0.1. Any idea what the issue could be? regards

Elhadi

susannasiebert commented 3 years ago

Did you use pVACtools version 2.0.4 for your latest run?

iichelhadi commented 3 years ago

Did you use pVACtools version 2.0.4 for your latest run?

no using version 2.0.2.

susannasiebert commented 3 years ago

You will need to use pvactools version 2.0.4 to resolve this error.

iichelhadi commented 3 years ago

You will need to use pvactools version 2.0.4 to resolve this error.

currently running with v2.0.4. seems to be working. Thank you