griffithlab / VAtools

A set of tools to annotate VCF files with expression and readcount data
http://www.vatools.org
MIT License
25 stars 12 forks source link

ValueError: invalid literal for int() with base 10: '"0' when using vcf-expression-annotator #73

Closed bobojin46 closed 9 months ago

bobojin46 commented 1 year ago

I got the following error when running vcf-expression-annotator

/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/header.py:413: FieldInfoNotFound: INFO "CONTQ not found using String/'.' instead warnings.warn( /home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py:251: CannotConvertValue: 19.4" cannot be converted to Float, keeping as string. warnings.warn( /home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py:251: CannotConvertValue: 0" cannot be converted to Integer, keeping as string. warnings.warn( Traceback (most recent call last): File "/home/bobojin/software/miniconda3/envs/vatools/bin/vcf-expression-annotator", line 8, in <module> sys.exit(main()) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vatools/vcf_expression_annotator.py", line 202, in main for entry in vcf_reader: File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/reader.py", line 175, in __next__ result = self.parser.parse_next_record() File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 802, in parse_next_record return self.parse_line(self._read_next_line()) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 793, in parse_line return self._record_parser.parse_line(line) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 467, in parse_line calls = self._handle_calls(alts, format_, arr[8], arr) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/parser.py", line 479, in _handle_calls call = record.Call(sample, data) File "/home/bobojin/software/miniconda3/envs/vatools/lib/python3.8/site-packages/vcfpy/record.py", line 238, in __init__ self.gt_alleles.append(int(allele)) ValueError: invalid literal for int() with base 10: '"0'

I have pip install upgrade vatools to 5.1.0 version as @susannasiebert said. And I've tried run ref-transcript-mismatch-reporter crc01_somatic_annotated.vcf -f hard but got almost the same error. So is it about my VEP-annotaed.vcf ? But unlike the error message, the input vcf is VEP-annotated and has the CONFQ info.

I would be appreciated to get some suggestions to solve the issue, thank you.

susannasiebert commented 1 year ago

From the stacktrace it looks like maybe one of the genotype fields is formatted badly. Would you be able to attach your VCF to this issue so I can have a closer look?

bobojin46 commented 1 year ago

crc01_somatic_annotated.zip this is my VCF file

susannasiebert commented 1 year ago

Looking at this VCF, it does indeed look like it is badly formatted. Take, for example, this entry:

chr1 146987573 . CTT C . PASS "CONTQ=93;DP=74;ECNT=1;GERMQ=64;MBQ=37,34;MFRL=344,363;MMQ=31,40;MPOS=15;NALOD=1.25;NLOD=4.5;POPAF=5.6;RPA=4,2;RU=T;SEQQ=93;STR;STRANDQ=53;STRQ=93;TLOD=19.4";CSQ=-|intron_variant|MODIFIER|NBPF12|ENSG00000268043|Transcript|ENST00000698835.1|protein_coding||28/36|ENST00000698835.1:c.3116+291_3116+292del|||||||||1||HGNC|HGNC:24297||2||MVVSAGPWSSEKAEMNILEINEKLRPQLAENKQQFRNLKERCFLTQLAGFLANRQKKYKYEECKDLIKFMLRNERQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQQLVQKLSPENDEDEDEDVQVEEDEKVLESSAPREVQKAEESKVPEDSLEECAITCSNSHGPCDSIQPHKNIKITFEEDKVNSTVVVDRKSSHDECQDALNILPVPGPTSSATNVSMVVSAGPLSSEKAEMNILEINEKLRPQLAEKKQQFRSLKEKCFVTQLAGFLAKQQNKYKYEECKDLIKSMLRNELQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQHLVQKLSPENDEDEDEDVQVEEDEKVLESSSPREMQKAEESKVPEDSLEECAITCSNSHGPCDSNQPHKNIKITFEEDKVNSSLVVDRESSHDECQDALNILPVPGPTSSATNVSMVVSAGPLSSEKAEMNILEINEKLRPQLAEKKQQFRSLKEKCFVTQVACFLAKQQNKYKYEECKDLLKSMLRNELQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQHLVQKLSPENDNDDDEDVQVEVAEKVQKSSSPREMQKAEEKEVPEDSLEECAITCSNSHGPYDSNQPHRKTKITFEEDKVDSTLIGSSSHVEWEDAVHIIPENESDDEEEEEKGPVSPRNLQESEEEEVPQESWDEGYSTLSIPPERLASYQSYSSTFHSLEEQQVCMAVDIGRHRWDQVKKEDQEATGPRLSRELLAEKEPEVLQDSLDRCYSTPSVYLGLTDSCQPYRSAFYVLEQQRVGLAVDMDEIEKYQEVEEDQDPSCPRLSRELLAEKEPEVLQDSLDRCYSTPSGYLELPDLGQPYRSAVYSLEEQYLGLALDVDRIKKDQEEEEDQGPPCPRLSRELLEVVEPEVLQDSLDRCYSTPSSCLEQPDSCQPYRSSFYALEEKHVGFSLDVGEIEKKGKGKKRRGRRSKKKRRRGRKEGEEDQNPPCPRLSRELLAEKEPEVLQDSLDRWYSTPSVYLGLTDPCQPYRSAFYVLEQQRVGLAVDMDEIEKYQEVEEDQDPSCPRLSRELLAEKEPEVLQDSLDRCYSTPSGYLELPDLGQPYRSAVYSLEEQYLGLALDVDRIKKDQEEEEDQGPPCPRLSRELLEVVEPEVLQDSLDRCYSTPSSCLEQPDSCQPYRSSFYALEEKHVGFSLDVGEIEKKGKGKKRRGRRSKKKRRRGRKEGEEDQNPPCPRLNSVLMEVEEPEVLQDSLDRCYSTPSMYFELPDSFQHYRSVFYSFEEQHITFALDMDNSFFTLTVTSLHLVFQMGVIFPQ GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC:SB "0/0:16,0:0.053:16:6,0:10,0:false:false:6,10,0,0" "0/1:37,7:0.17:44:18,4:19,3:false:false:11,26,2,5"

Note the quotation marks around the sample information (e.g., "0/0:16,0:0.053:16:6,0:10,0:false:false:6,10,0,0"). There are also quotation marks around parts of the INFO field - the part that was there before VEP annotation: "CONTQ=93;DP=74;ECNT=1;GERMQ=64;MBQ=37,34;MFRL=344,363;MMQ=31,40;MPOS=15;NALOD=1.25;NLOD=4.5;POPAF=5.6;RPA=4,2;RU=T;SEQQ=93;STR;STRANDQ=53;STRQ=93;TLOD=19.4". Those are all invalid. I'm not sure how this VCF was created but all of these need to be removed. I would carefully evaluate the pipeline that created this VCF to determine what caused these quotation marks to be added in the first place.

bobojin46 commented 1 year ago

Looking at this VCF, it does indeed look like it is badly formatted. Take, for example, this entry:

chr1 146987573 . CTT C . PASS "CONTQ=93;DP=74;ECNT=1;GERMQ=64;MBQ=37,34;MFRL=344,363;MMQ=31,40;MPOS=15;NALOD=1.25;NLOD=4.5;POPAF=5.6;RPA=4,2;RU=T;SEQQ=93;STR;STRANDQ=53;STRQ=93;TLOD=19.4";CSQ=-|intron_variant|MODIFIER|NBPF12|ENSG00000268043|Transcript|ENST00000698835.1|protein_coding||28/36|ENST00000698835.1:c.3116+291_3116+292del|||||||||1||HGNC|HGNC:24297||2||MVVSAGPWSSEKAEMNILEINEKLRPQLAENKQQFRNLKERCFLTQLAGFLANRQKKYKYEECKDLIKFMLRNERQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQQLVQKLSPENDEDEDEDVQVEEDEKVLESSAPREVQKAEESKVPEDSLEECAITCSNSHGPCDSIQPHKNIKITFEEDKVNSTVVVDRKSSHDECQDALNILPVPGPTSSATNVSMVVSAGPLSSEKAEMNILEINEKLRPQLAEKKQQFRSLKEKCFVTQLAGFLAKQQNKYKYEECKDLIKSMLRNELQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQHLVQKLSPENDEDEDEDVQVEEDEKVLESSSPREMQKAEESKVPEDSLEECAITCSNSHGPCDSNQPHKNIKITFEEDKVNSSLVVDRESSHDECQDALNILPVPGPTSSATNVSMVVSAGPLSSEKAEMNILEINEKLRPQLAEKKQQFRSLKEKCFVTQVACFLAKQQNKYKYEECKDLLKSMLRNELQFKEEKLAEQLKQAEELRQYKVLVHSQERELTQLREKLREGRDASRSLNEHLQALLTPDEPDKSQGQDLQEQLAEGCRLAQHLVQKLSPENDNDDDEDVQVEVAEKVQKSSSPREMQKAEEKEVPEDSLEECAITCSNSHGPYDSNQPHRKTKITFEEDKVDSTLIGSSSHVEWEDAVHIIPENESDDEEEEEKGPVSPRNLQESEEEEVPQESWDEGYSTLSIPPERLASYQSYSSTFHSLEEQQVCMAVDIGRHRWDQVKKEDQEATGPRLSRELLAEKEPEVLQDSLDRCYSTPSVYLGLTDSCQPYRSAFYVLEQQRVGLAVDMDEIEKYQEVEEDQDPSCPRLSRELLAEKEPEVLQDSLDRCYSTPSGYLELPDLGQPYRSAVYSLEEQYLGLALDVDRIKKDQEEEEDQGPPCPRLSRELLEVVEPEVLQDSLDRCYSTPSSCLEQPDSCQPYRSSFYALEEKHVGFSLDVGEIEKKGKGKKRRGRRSKKKRRRGRKEGEEDQNPPCPRLSRELLAEKEPEVLQDSLDRWYSTPSVYLGLTDPCQPYRSAFYVLEQQRVGLAVDMDEIEKYQEVEEDQDPSCPRLSRELLAEKEPEVLQDSLDRCYSTPSGYLELPDLGQPYRSAVYSLEEQYLGLALDVDRIKKDQEEEEDQGPPCPRLSRELLEVVEPEVLQDSLDRCYSTPSSCLEQPDSCQPYRSSFYALEEKHVGFSLDVGEIEKKGKGKKRRGRRSKKKRRRGRKEGEEDQNPPCPRLNSVLMEVEEPEVLQDSLDRCYSTPSMYFELPDSFQHYRSVFYSFEEQHITFALDMDNSFFTLTVTSLHLVFQMGVIFPQ GT:AD:AF:DP:F1R2:F2R1:OBAM:OBAMRC:SB "0/0:16,0:0.053:16:6,0:10,0:false:false:6,10,0,0" "0/1:37,7:0.17:44:18,4:19,3:false:false:11,26,2,5"

Note the quotation marks around the sample information (e.g., "0/0:16,0:0.053:16:6,0:10,0:false:false:6,10,0,0"). There are also quotation marks around parts of the INFO field - the part that was there before VEP annotation: "CONTQ=93;DP=74;ECNT=1;GERMQ=64;MBQ=37,34;MFRL=344,363;MMQ=31,40;MPOS=15;NALOD=1.25;NLOD=4.5;POPAF=5.6;RPA=4,2;RU=T;SEQQ=93;STR;STRANDQ=53;STRQ=93;TLOD=19.4". Those are all invalid. I'm not sure how this VCF was created but all of these need to be removed. I would carefully evaluate the pipeline that created this VCF to determine what caused these quotation marks to be added in the first place.

I have run VEP、vaools and pVACseq successfully using similiar VCF file . And the info field in that VCF does also have DP=419;ECNT=1;GERMQ=93;MBQ=20,34;MFRL=190,221;MMQ=60,60;MPOS=35;NALOD=1.73;NLOD=15.65;POPAF=6.0;TLOD Although indeed I got these VCF file from different sources. I don't know what the problem is. Really thank you for helping me out.

bobojin46 commented 1 year ago

succsessfully_run.zip this is my succsessfully run VCF.

susannasiebert commented 1 year ago

I'm a bit confused. The successfully run VCF also fails for me when running the ref-transcript-mismatch-reporter with error: vcfpy.exceptions.IncorrectVCFFormat: Missing line starting with "#CHROM". I see extra empty lines in the header, quotation marks around headers, and extra tabs, all of which are causing problems. You say that you were able to run the ref-transcript-mismatch-reporter on this VCF?

My best guess is that these VCF were opened and saved in Excel at some point and it added additional formatting that doesn't conform to the VCF specs. In general, VCF files should not be opened in Excel but only in a text editor of some sort to prevent these sort of problems.

susannasiebert commented 9 months ago

Closing this issue due to inactivity.