ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
117 stars 14 forks source link

VCF file is missing mandatory header line ("#CHROM...") error, inconsistent error timing #16

Closed peterdfields closed 3 years ago

peterdfields commented 3 years ago

Hi @ksamuk

I'm trying to use pixy to calculate pi on a relatively small dataset (8 diploid individuals). Making an initial test run I used the following command:

pixy --stats pi --vcf for_theta.vcf.gz --zarr_path /theta_for_pw/zarr/ --window_size 10000 --populations pop.txt --bypass_filtration yes --outfile_prefix output/pixy_out

The used vcf was was generated using the instructions provided in the online tutorial for bcftools. The error I see is the following:

Traceback (most recent call last):
  File "/home/peter/miniconda3/envs/vcf-manip/bin/pixy", line 11, in <module>
    sys.exit(main())
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/pixy/__main__.py", line 293, in main
    allel.vcf_to_zarr(args.vcf, zarr_path, region=targ_region, fields='*', overwrite=True)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 918, in vcf_to_zarr
    fields, samples, headers, it = iter_vcf_chunks(
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1138, in iter_vcf_chunks
    fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1636, in _iter_vcf_stream
    headers = _read_vcf_headers(stream)
  File "/home/peter/miniconda3/envs/vcf-manip/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1763, in _read_vcf_headers
    raise RuntimeError('VCF file is missing mandatory header line ("#CHROM...")')
RuntimeError: VCF file is missing mandatory header line ("#CHROM...")

The curious issue is that if I re-run this command the error message can arise after processing a different number of contigs. So I'm not entirely sure what might be going wrong. Please let me know if any additional information would be helpful to troubleshoot this error.

ksamuk commented 3 years ago

Hi @peterdfields, that does seem odd! That looks like an error from scikit-allel (which pixy uses under the hood), which is discussed here: http://alimanfoo.github.io/2017/06/14/read-vcf.html.

This section seems to address that error specifically:

If you get an error message like “RuntimeError: VCF file is missing mandatory header line (“#CHROM…”)” then check your tabix version and upgrade if necessary. If you have conda installed, a recent version tabix can be installed via the following command: conda install -c bioconda htslib.

Hope that helps, let me know if the error persists and we can try to troubleshoot it more.

peterdfields commented 3 years ago

Hi @ksamuk

Thank you for getting back to me. I updated htslib to 1.11 and restarted the pixy analysis. The analysis did go further through the reference than before (though it has seemingly gone further each time I've run the command) but the error arose again. I hadn't tabix indexed the vcf.gz file so I did that and restarted pixy though I guess that probably isn't the issue as the analysis was proceeding before. Anyway, let me know if you have any other suggestions I should try or if additional info would be useful.

ksamuk commented 3 years ago

Hi @peterdfields, sorry to hear this still isn't working. Are you getting the same error message as before? If you'd be willing to send me your VCF (the whole thing, or in part), I can see if I can get it working on my end. You can post a link here, or reach me at ksamuk@gmail.com (sharing the VCF via dropbox or the like might be easiest).

peterdfields commented 3 years ago

Hi @ksamuk After I indexed the .vcf.gz file with tabix the command both completed and ran considerably faster. I may be missing instructions in the materials, and realize it's probably obvious to a lot of users, but it might be worthwhile to explicitly state in the tutorial materials that the index is needed for working with the compressed vcf file. Thank you again for your help and for a great piece of software! I'll go ahead and close this issue now.

ksamuk commented 3 years ago

Thanks for this response, @peterdfields! The next version of pixy (to be released very soon, a performance update) requires a compressed vcf and tabix, and we'll definitely make sure to emphasize that requirement in the docs.