ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
115 stars 14 forks source link

IndexError: list index out of range #69

Closed Pcariman closed 1 year ago

Pcariman commented 1 year ago

Hi ! I am working with whole genome data of 20 individuals from two populations, to avoid memory limitation issues I generated a VCF Allsites file for chromosome/contig 'Sc1nRTI_147;HRSCAF=971'. When running pixy I got the following error message:

pixy --stats pi fst dxy --vcf 2_concat_allsites.vcf.gz --populations popfile.txt --window_size 10000 --n_cores 4 --chromosomes 'Sc1nRTI_147;HRSCAF=971' [pixy] pixy 1.2.7.beta1 [pixy] See documentation at https://pixy.readthedocs.io/en/latest/

[pixy] Validating VCF and input parameters... [pixy] Checking write access...OK [pixy] Checking CPU configuration...OK [pixy] Checking for invariant sites...OK [pixy] Checking chromosome data...OK [pixy] Checking intervals/sites...OK [pixy] Checking sample data...OK [pixy] All initial checks past!

[pixy] Preparing for calculation of summary statistics: pi, fst, dxy [pixy] Using Weir and Cockerham (1984)'s estimator of FST. [pixy] Data set contains 2 population(s), 1 chromosome(s), and 20 sample(s) [pixy] Window size: 10000 bp

[pixy] Started calculations at 12:00:38 on 2023-01-23 [pixy] Using 10 out of 48 available CPU cores

[pixy] Processing chromosome/contig Sc1nRTI_147;HRSCAF=971... IndexError: list index out of range

A reproducible example of the problem If your question is about how your dataset is specifically being processed by pixy, please copy and paste the following: (1) The command I used to run pixy is: pixy --stats pi fst dxy --vcf 2_concat_allsites.vcf.gz --populations popfile.txt --window_size 10000 --n_cores 4 --chromosomes 'Sc1nRTI_147;HRSCAF=971' (2) A subset of my VCF is: The uncompressed VCF file weighs 6.2 G

fileformat=VCFv4.2

FILTER=

bcftoolsVersion=1.10.2+htslib-1.10.2-3ubuntu0.1

bcftoolsCommand=mpileup --threads 30 -Oz --output 1_pileup_sar_te.vcf.gz -f ../../../genoma_ref/jasmine-uni1728-mb-hirise-3bs35_08-29-2020__hic_output.fasta --bam-list SARTE.txt -r Sc1nRTI_147;HRSCAF=971

reference=file://../../../genoma_ref/jasmine-uni1728-mb-hirise-3bs35_08-29-2020__hic_output.fasta

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

. .

contig=

. .

contig=

ALT=

INFO=

INFO=

INFO=

INFO=

INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

bcftools_callVersion=1.10.2+htslib-1.10.2-3ubuntu0.1

bcftools_callCommand=call -m -f GQ -Oz -o 1_called_sar_te.vcf.gz 1_pileup_sar_te.vcf.gz; Date=Wed Jan 11 11:35:23 2023

bcftools_filterVersion=1.10.2+htslib-1.10.2-3ubuntu0.1

bcftools_filterCommand=filter -Oz -o 1.1_CALLED_DP.vcf.gz -i 'DP >= 4 && DP <= 500' ../1_called_sar_te_GL02.vcf.gz; Date=Mon Jan 16 10:48:12 2023

bcftools_concatVersion=1.10.2+htslib-1.10.2-3ubuntu0.1

bcftools_concatCommand=concat --allow-overlaps -O z -o 3_concat_allsites_sar_te.vcf.gz invariants.vcf.gz variants.vcf.gz; Date=Mon Jan 16 11:39:16 2023

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sar_TE02_aln_sorted.bam Sar_TE03_aln_sorted.bam Sar_TE04_aln_sorted.bam Sar_TE05_aln_sorted.bam Sar_TE06_aln_sorted.bam Sar_TE07_aln_sorted.bam Sar_TE11_aln_sorted.bam Sar_TE12_aln_sorted.bam Sar_TE14_aln_sorted.bam Sar_TE16_aln_sorted.bam

Sc1nRTI_147;HRSCAF=971 7 . A . 74 PASS . GT 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 ./. Sc1nRTI_147;HRSCAF=971 8 . A . 81 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. Sc1nRTI_147;HRSCAF=971 9 . A . 81 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. Sc1nRTI_147;HRSCAF=971 10 . A . 86 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. Sc1nRTI_147;HRSCAF=971 11 . T . 88 PASS . GT 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. . . . Sc1nRTI_147;HRSCAF=971 598 . C T 999 PASS . GT:PL:GQ ./.:0,0,0:0 1/1:255,54,0:61 1/1:84,9,0:16 1/1:99,9,0:16 0/1:105,0,75:67 1/1:90,6,0:13 1/1:148,15,0:22 1/1:116,9,0:16 0/1:31,0,13:6 1/1:165,21,0:28 Sc1nRTI_147;HRSCAF=971 599 . C . 999 PASS . GT ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 Sc1nRTI_147;HRSCAF=971 600 . C T 129 PASS . GT:PL:GQ ./.:0,0,0:0 0/1:89,0,255:84 0/0:0,15,136:19 0/0:0,9,102:14 0/0:0,24,192:28 0/1:90,6,0:6 0/0:0,15,151:19 0/0:0,9,134:14 0/0:0,6,52:11 0/0:0,24,177:28 . . .

(2) and the full populations file, separated by tabs where the first column contains sample names (exactly as represented in the VCF) and the second column contains population names: $popfile.txt Ssc_TE01_aln_sorted.bam TE Ssc_TE02_aln_sorted.bam TE Ssc_TE03_aln_sorted.bam TE Ssc_TE04_aln_sorted.bam TE Ssc_TE05_aln_sorted.bam TE Ssc_TE06_aln_sorted.bam TE Ssc_TE07_aln_sorted.bam TE Ssc_TE08_aln_sorted.bam TE Ssc_TE11_aln_sorted.bam TE Ssc_TE15_aln_sorted.bam TE Ssc_CN01_aln_sorted.bam CN Ssc_CN02_aln_sorted.bam CN Ssc_CN04_aln_sorted.bam CN Ssc_CN06_aln_sorted.bam CN Ssc_CN08_aln_sorted.bam CN Ssc_CN09_aln_sorted.bam CN Ssc_CN10_aln_sorted.bam CN Ssc_CN11_aln_sorted.bam CN Ssc_CN12_aln_sorted.bam CN Ssc_CN13_aln_sorted.bam CN

(3) I generated the AllSites VCF file using BCFtools, as listed in the pixy guide (https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html).

Best regards,

ksamuk commented 1 year ago

Hi there, this error seems to associate with complex contig names containing non-alphanumeric characters. We're working on a fix for this, but in the meantime, you could try removing the non-alphanumeric characters from your contig names.

Pcariman commented 1 year ago

Hi, thanks for your response. I solved my problem by changing the character (;) of the contig to (_). I used the following command lines:

$bcftools query -f '%CHROM\n' myvcf.vcf.gz > chrom.txt $cat chrome.txt | awk -F ";" '{print $1";"$2" "$1"_"$2}' >> chrom_mod.txt $bcftools annotate --rename-chrs chrom_mod.txt myvcf.vcf.gz | bgzip > my_vcf_ready_to_pixy.vcf.gz