ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

ValueError: invalid literal for int() with base 10: '' #87

Closed jbruxaux closed 9 months ago

jbruxaux commented 1 year ago

Describe the bug

I am currently running pixy on a very long genome (several Gb), with 200 individuals in one pop. Since running any analysis on this huge vcf file is usually not doable, I called SNPs on several parts of the genome independently (around 50 vcf files). For most vcf files, pixy runs without issue with the site file. But for some, when I run the command below, I get this error message: ValueError: invalid literal for int() with base 10: '' I tried with different windows values (100 000, 10 000, 100). It didn't change anything.

After re-running the command with the --debug flag, I get this:

[pixy] pixy 1.2.7.beta1
[pixy] See documentation at https://pixy.readthedocs.io/en/latest/
/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">\n'
  warnings.warn('invalid INFO header: %r' % header)

[pixy] Validating VCF and input parameters...
[pixy] Checking write access...OK
[pixy] Checking CPU configuration...OK
[pixy] Checking for invariant sites...OK
[pixy] Checking chromosome data...OK
[pixy] Checking intervals/sites...WARNING
[pixy] WARNING: the following chromosomes in the sites file do not occur in the VCF and will be ignored: [ very long list of chromosomes ]
[pixy] Checking sample data...OK
[pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi
[pixy] Data set contains 1 population(s), 1 chromosome(s), and 195 sample(s)
[pixy] Window size: 100000 bp
[pixy] Calculations restricted to sites in: fold4_split_500Mb

[pixy] Started calculations at 16:21:44 on 2023-08-29
[pixy] Using 1 out of 28 available CPU cores

[pixy] Processing chromosome/contig chr2:2000000001-...
[pixy] Calculating statistics for region chr2:2000000001-:37076-317391406...
/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/allel/io/vcf_read.py:1732: UserWarning: invalid INFO header: '##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">\n'
  warnings.warn('invalid INFO header: %r' % header)
Traceback (most recent call last):
  File "/home/j/jbruxaux/miniconda3/envs/pixy/bin/pixy", line 11, in <module>
    sys.exit(main())
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/pixy/__main__.py", line 312, in main
    pixy.core.compute_summary_stats(args, popnames, popindices, temp_file, chromosome, chunk_pos_1, chunk_pos_2, window_list_chunk, q, sites_list_chunk, aggregate, args.window_size)
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/pixy/core.py", line 210, in compute_summary_stats
    callset_is_none, gt_array, pos_array = read_and_filter_genotypes(args, chromosome, chunk_pos_1, chunk_pos_2, sites_list_chunk)
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/pixy/core.py", line 164, in read_and_filter_genotypes
    callset = allel.read_vcf(args.vcf, region = window_region, fields = ['CHROM', 'POS', 'calldata/GT', 'variants/is_snp', 'variants/numalt'])
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/allel/io/vcf_read.py", line 304, in read_vcf
    fields, samples, headers, it = iter_vcf_chunks(
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1138, in iter_vcf_chunks
    fields, samples, headers, it = _iter_vcf_stream(stream, **kwds)
  File "/home/j/jbruxaux/miniconda3/envs/pixy/lib/python3.8/site-packages/allel/io/vcf_read.py", line 1672, in _iter_vcf_stream
    chunks = VCFChunkIterator(
  File "allel/opt/io_vcf_read.pyx", line 460, in allel.opt.io_vcf_read.VCFChunkIterator.__init__
  File "allel/opt/io_vcf_read.pyx", line 516, in allel.opt.io_vcf_read.VCFParser.__init__
  File "allel/opt/io_vcf_read.pyx", line 546, in allel.opt.io_vcf_read.VCFParser._init_region
ValueError: invalid literal for int() with base 10: ''

The warning on the chromosomes absent in the vcf is expected, since with site file covers the entire genome, and the vcf only part of it.

A reproducible example of the bug

(1) The full command you used to run pixy, including all arguments pixy --stats pi --vcf $file --populations ind_pop.txt --window_size 100000 --n_cores 8 --output_folder pixy_blister_rust --output_prefix ${short}_noMAF_pi4 --sites_file fold4_split_500Mb

(2) A subset of your VCF (the smallest subset needed to demonstrate the problem) subset vcf file (first 10 000 SNPs): https://umeauniversity-my.sharepoint.com/:u:/g/personal/jabr0052_ad_umu_se/EVsIHi1rn1BFkSKEqSiI4_wBqTvGcBMp5qP4P2batOQRiQ?e=kTbwLC

(3) The full populations file pop file: https://umeauniversity-my.sharepoint.com/:t:/g/personal/jabr0052_ad_umu_se/EW89hveFWNRGnF7WGtrClwkBrKzsI8kC625zNUjsR7VXLA?e=44QNDa

(4) Any other files needed to reproduce the problem (sites, bed file, etc.) site file (355Mb): https://umeauniversity-my.sharepoint.com/:u:/g/personal/jabr0052_ad_umu_se/EVtjDWA1uZlOqClU6H2HphcBSrYjl3cnXsg9gmbPkTNtNA?e=Tax5FI

OS information Cluster with Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-159-generic x86_64)

jbruxaux commented 12 months ago

So, I also solve that issue. It all came from the chromosome names which had either : or - or both. Once I renamed all my chromosome names in all the vcf files and the site files, it worked without any issue.

I think it would be worth either allowing the use of any character in the chromosome name or at least warn the users of that issue.

ksamuk commented 12 months ago

Thanks for letting me know and we'll address this in the next update!