robertopreste / HmtNote

Human mitochondrial variants annotation using HmtVar.
https://hmtnote.readthedocs.io
MIT License
16 stars 1 forks source link

VCF file parsing #85

Closed smukh18 closed 4 years ago

smukh18 commented 4 years ago

Description

I am using a vcf file to annotate. However, when I see the output csv file , it is not parsed propoerly.

What I Did

hmtnote annotate R4006.vcf 1.vcf --csv

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Annotating... [------------------------------------] 1% 00:14:37/home/smukherjee/.local/lib/python3.6/site-packages/vcfpy2/parser.py:256: CannotConvertValue: 1,1,1 cannot be converted to Integer, keeping as string. CannotConvertValue, Annotating... [------------------------------------] 1% 00:14:34/home/smukherjee/.local/lib/python3.6/site-packages/vcfpy2/parser.py:256: CannotConvertValue: 1,1 cannot be converted to Integer, keeping as string. CannotConvertValue, Annotating... [##----------------------------------] 8% 00:12:53/home/smukherjee/.local/lib/python3.6/site-packages/vcfpy2/parser.py:256: CannotConvertValue: 1,1,1,1 cannot be converted to Integer, keeping as string. CannotConvertValue, Annotating... [###---------------------------------] 8% 00:13:02/home/smukherjee/.local/lib/python3.6/site-packages/vcfpy2/parser.py:256: CannotConvertValue: 1,1,1,1,1 cannot be converted to Integer, keeping as string. CannotConvertValue, Annotating... [####################################] 100%
Converting annotated VCF file to CSV format... /home/smukherjee/.local/lib/python3.6/site-packages/allel/io/vcf_read.py:1870: UserWarning: not all characters parsed for integer value; field: INFO; variant: 12 (chrMT:302) chunks = [d[0] for d in it] /home/smukherjee/.local/lib/python3.6/site-packages/allel/io/vcf_read.py:1870: UserWarning: not all characters parsed for integer value; field: INFO; variant: 14 (chrMT:309) chunks = [d[0] for d in it] /home/smukherjee/.local/lib/python3.6/site-packages/allel/io/vcf_read.py:1870: UserWarning: not all characters parsed for integer value; field: INFO; variant: 15 (chrMT:310) chunks = [d[0] for d in it] /home/smukherjee/.local/lib/python3.6/site-packages/allel/io/vcf_read.py:1870: UserWarning: not all characters parsed for integer value; field: INFO; variant: 37 (chrMT:929) chunks = [d[0] for d in it]

And the csv looks like this :(Just partial shown)

CHROM POS ID REF ALT QUAL AC AN NtVarH NtVarP
chrMT 73 . A G;.;.;.;. . 1;-1;-1;-1;-1 2 0.693878;.;.;.;. 0.48675;.;.;.;.
chrMT 152 . T C;.;.;.;. . 1;-1;-1;-1;-1 2 0.710753;.;.;.;. 0.392746;.;.;.;.
chrMT 182 . C T;.;.;.;. . 1;-1;-1;-1;-1 2 0.067017;.;.;.;. 0.017046;.;.;.;.
chrMT 185 . G T;.;.;.;. . 1;-1;-1;-1;-1 2 0.162199;.;.;.;. 0.098877;.;.;.;.
chrMT 195 . T C;.;.;.;. . 1;-1;-1;-1;-1 2 0.580496;.;.;.;. 0.300672;.;.;.;.

Any help ?

robertopreste commented 4 years ago

@smukh18 thanks for the report.

Could you paste or upload a sample of the starting VCF file? It is difficult to understand what went wrong without the original file.

sreya12 commented 4 years ago

This is the start

fileformat=VCFv4.0

reference=chrRCRS

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

INFO=

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT R4016 R4016_D14V1ACXX_TTAGGC_L004_R2_001

chrMT 57 . T TC . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:1342:0.004:0.002:0.01:4;2 0 chrMT 64 . C T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:1410:0.013:0.009:0.021:15;4 0 chrMT 79 . G T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:1630:0.01:0.006:0.017:16;1 0 chrMT 84 . A T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:1748:0.003:0.001:0.008:6;0 0 chrMT 90 . G T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:1925:0.003:0.001:0.006:5;0 0 chrMT 108 . A G . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2117:0.002:0.001:0.006:5;0 0 chrMT 146 . T C . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2660:0.005:0.003:0.008:10;3 0 chrMT 160 . A T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2848:0.002:0.001:0.004:5;0 0 chrMT 231 . CATA C . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:3493:0.002:0.001:0.004:4;2 0 chrMT 234 . A C . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:3492:0.003:0.002:0.006:12;0 0 chrMT 239 . T TAAC,C . PASS AC=1,1;AN=4 GT:DP:HF:CILOW:CIUP:SDP 0/1/2:3460:0.01,0.972:0.007,0.966:0.014,0.977:13;22,1579;1784 0 chrMT 242 . C A . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:3656:0.002:0.001:0.005:9;0 0 chrMT 244 . A ACAA . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:3643:0.009:0.006:0.012:23;8 0 chrMT 263 . A G . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:3051:0.999:0.997:1.0:1505;1542 0 chrMT 285 . C CA,CAA . PASS AC=1,1;AN=4 GT:DP:HF:CILOW:CIUP:SDP 0/1/2:2879:0.026,0.003:0.021,0.001:0.033,0.006:31;45,3;5 0 chrMT 291 . A ATT,AT . PASS AC=1,1;AN=4 GT:DP:HF:CILOW:CIUP:SDP 0/1/2:2719:0.009,0.025:0.006,0.02:0.014,0.032:9;16,31;38 0 chrMT 295 . C T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2508:0.049:0.041:0.058:75;48 0 chrMT 296 . C T . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2316:0.013:0.009:0.018:16;13 0 chrMT 298 . C A . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2383:0.004:0.002:0.008:7;3 0 chrMT 299 . C CA . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2335:0.014:0.01:0.019:11;21 0 chrMT 302 . A AC,C . PASS AC=1,1;AN=4 GT:DP:HF:CILOW:CIUP:SDP 0/1/2:1330:0.011,0.005:0.007,0.002:0.019,0.01:6;9,0;6 0 chrMT 303 . C A . PASS AC=1;AN=3 GT:DP:HF:CILOW:CIUP:SDP 0/1:2315:0.005:0.003:0.009:11;1 0 chrMT 309 . CTCCCCC CTCCCCCCT,C,TTCCCCC . PASS AC=1,1,1;AN=5 GT:DP:HF:CILOW:CIUP:SDP 0/1/2/3:1871:0.007,0.004,0.003:0.004,0.002,0.001:0.013,0.008,0.007:11;3,0;9,4;2 0

It is the output vcf from mtoolbox. But the parsing is incorrect and I am also getting errors in the csv output.

robertopreste commented 4 years ago

Hello @sreya12, when I try to annotate the VCF file you posted I'm not getting the same error experienced by @smukh18. In this case, the issue is due to using whitespaces instead of tabs as column delimiter. When I replace whitespaces with tabs, it gets annotated correctly.

sreya12 commented 4 years ago

This is a vcf output from Mtoolbox directly. How should I change the delimiter for each file? Can you suggest anything?

robertopreste commented 4 years ago

I am not aware of issues with MToolBox related to whitespaces, it should output VCF files in the correct format. The issue could be due to the copy/pasting here. Anyway, I suggest you use a VCF validator, there are many available both online and as standalone tools, to make sure that your file is in the correct format.

robertopreste commented 4 years ago

Closing as not related to HmtNote.