griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

CannotConvertValueError for vcf file #1127

Closed lukaas33 closed 1 month ago

lukaas33 commented 1 month ago

Installation Type

Docker

pVACtools Version / Docker Image

griffithlab/pvactools:latest

Python Version

No response

Operating System

No response

Describe the bug

I have verified that pvactools is setup correctly by running the example command as specified here https://pvactools.readthedocs.io/en/latest/pvacseq/getting_started.html and seeing that predictions are created.

When running the tools on a real inputfile (preprocessed with VEP) taken from here https://pdmdb.cancer.gov/pdm/145666~245-R~AJA~v2.0.2.51.0~WES.vcf, the following errors are observed: CannotConvertValue: 1.00 cannot be converted to Integer, keeping as string. and then: TypeError: '>' not supported between instances of 'str' and 'int'

It seems that the sample column of this file contains frequencies that are between 0 and 1 and this gives an error in the parser. I have looked at the example vcf data and this does not occur.

This issue can be avoided by adding mock sample data using cf-genotype-annotator inputpath samplename 0/1 -o outputpath. But with this workaround not all data is used.

Is this a limitation of pvactools, wrongly formatted data or am I using a wrong command?

How to reproduce this bug

pvacseq run inputpath samplename HLA-A*02:01,HLA-B*07:02,HLA-C*07:02 NetMHC outputpath -e1 10 -e2 15 --iedb-install-directory /opt/iedb

Input files

Raw input data from https://pdmdb.cancer.gov/pdm/145666~245-R~AJA~v2.0.2.51.0~WES.vcf: 145666-245-R-AJA-v2.0.2.51.0-WES.zip

Data preprocessed with VEP: https://tuenl-my.sharepoint.com/:u:/g/personal/l_c_a_w_v_osenbruggen_student_tue_nl/EedDI_dhjThBnwuvH2PLp3ABY6lTKh9-JfuuXrVdKSBy4A?e=6AMaUy

Log output

CannotConvertValue: 1.00 cannot be converted to Integer, keeping as string. and then: TypeError: '>' not supported between instances of 'str' and 'int'

Output files

No response

susannasiebert commented 1 month ago

Thank you for your interest in pVACtools and reaching out to us with this errors. This is a small issue with the input VCF where the meta-information about a field is defined incorrectly. In this particular case the AF FORMAT field is defined as an Integer field in its header. This is incorrect, because this field contains floating point values/decimals. The VCF parser we use is pretty strict about casting field values to the types defined in their respective headers so in this case it is trying to convert the number in this field (a decimal) to an integer, which fails. This issue can be fixed by simply editing the AF header line and changing the Type to be Float instead of Integer.

In addition, this field also has its Number defined incorrectly. This error pops up after fixing the Type of the AF FORMAT field. The Number for this field is set to 1, which is supposed to mean that this field will only ever contains a single number. However, the field really contains one number per alt allele. To correct this, the field header should be changed to Number=A, as per the VCF spec.

In summary, please replace the following line ##FORMAT=<ID=AF,Number=1,Type=Integer,Description="Allelic frequency for the alt alleles in the order listed"> with this line ##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allelic frequency for the alt alleles in the order listed">

lukaas33 commented 1 month ago

Hi,

Thank you so much! Since this was a wrongly formatted vcf file you could close this issue. Unless you want to catch this specific case?