ksamuk / pixy

Software for painlessly estimating average nucleotide diversity within and between populations
https://pixy.readthedocs.io/
MIT License
116 stars 14 forks source link

Runtimes #8

Closed merondun closed 3 years ago

merondun commented 4 years ago

Hey, appreciate the software. I've been hitting some snags with runtimes and I'm wondering if you have any parallelization or scatter/gather recommendations.

I have GATK 4.x gvcf with 30 individuals, 1.1 Gb chromosome-level genome (28 scaffolds, about 37 GB filesize).

I first tried pixy v0.95.0 with default variant filtration on a single chromosome (around 70MB), and it ran for almost 48 hours on 3 CPUs and failed due to memory. I then tried it on 20 CPUs (120GB RAM) and it failed again due to memory.

Then I tried filtering the gvcf beforehand, and then dividing the genome into 25 MB bed intervals. But, the zarr database is created for each chromosome individually, so if I run 2 jobs for a single chromosome I get errors about a pre-existing directory for that chromosome.

My solution now is to create the zarr database in the scratch directory so that it can be parallelized on these bed intervals, but obviously this is not very efficient because the zarr database takes many hours to create...

For clarity, I provide a bed file like this:

chr1    50000000        75000000

And run pixy like this, specifying the prefix of the bed file:

lists=/crex/proj/uppstore2017162/b2016151_nobackup/justin/crow_reseq/interval_files

file=$1
chr=$(awk '{print $1}' ${lists}/${file}.bed)
start=$(awk '{print $2}' ${lists}/${file}.bed)
end=$(awk '{print $3}' ${lists}/${file}.bed)

mkdir $SNIC_TMP/data
mkdir $SNIC_TMP/data/${file}

pixy --stats pi fst dxy \
--vcf crow.popseq.SNPs.filtered.g.vcf.gz \
--zarr_path $SNIC_TMP/data/zarr/${file} \
--chromosomes ${chr} \
--window_size 250 \
--bypass_filtration yes \
--populations SPECIES_CROW.txt \
--outfile_prefix output/${file}_out

In general this works okay, but it will take around 38 hours on 2 CPUs to finish a 25 MB region. Therefore this will take several thousand CPU hours to calculate these metrics...which can be done with alternative software in a few dozen CPU hours.

Do you have any recommendations on how to first efficiently create this zarr database? My data is not very large and the genome is comparatively small, so I'm surprised.

Thanks!

kkorunes commented 4 years ago

Hi @merondun, thanks for the interest! Those runtimes are definitely impractically long. Am I understanding correctly that you're using a gvcf as input? Pixy is set up to run on 'all sites' VCFs, allowing you to use variants generated by jointly genotyping over your samples. Are you able to try using the “-all-sites” flag of GenotypeGVCFs to generate an ‘all sites’ VCFs for input into pixy?

merondun commented 4 years ago

Hey @kkorunes, of course -- it includes all sites. Here is my workflow so far, which gives me a vcf with around ~900 million sites (1.1gb genome). On the larger chromosomes (>100MB), pixy appears to hang for days at a time -- presumably it's creating the 'zarr' database that I'm unfamiliar with. I still haven't been able to finish any job on a chromosome > 50 MB if I choose a small window size (we need ~250 BP windows for some fine-scale cis- work right now).

I've had this run to completion with 10 KB windows and the results corroborate results from elsewhere, but it still took ~38 hours per job (52 jobs for each 25 MB interval) with 3 CPUs each.

#Combine Raw gVCFs
wd=/crex/proj/uppstore2017162/b2016151_nobackup/justin/crow_reseq
genome=/crex/proj/uppstore2019047/Genomes/crow5.6/Corvus_cornix__S_Up_H32_v5.6_polished.fasta
lists=/crex/proj/uppstore2017162/b2016151_nobackup/justin/crow_reseq/interval_files

chrom=$1

#gather gatk raw gvcf
gatk GenomicsDBImport -V D_Ko_C02.g.vcf.gz -V D_Ko_C04.g.vcf.gz -V D_Ko_C05.g.vcf.gz -V D_Ko_C08.g.vcf.gz -V D_Ko_C11.g.vcf.gz -V D_Ko_C13.g.vcf.gz -V D_Ko_C15.g.vcf.gz -V D_Ko_C19.g.vcf.gz -V D_Ko_C20.g.vcf.gz -V D_Ra_C01.g.vcf.gz -V D_Ra_C05.g.vcf.gz -V D_Ra_C06.g.vcf.gz -V D_Ra_C11.g.vcf.gz -V D_Ra_C14.g.vcf.gz -V D_Ra_C16.g.vcf.gz -V S_Ri_H05.g.vcf.gz -V S_Ri_H07.g.vcf.gz -V S_Ri_H23.g.vcf.gz -V S_Ri_H29.g.vcf.gz -V S_Ri_H43.g.vcf.gz -V S_Up_H03.g.vcf.gz -V S_Up_H09.g.vcf.gz -V S_Up_H16.g.vcf.gz -V S_Up_H24.g.vcf.gz -V S_Up_H29.g.vcf.gz -V S_Up_H32.g.vcf.gz -V S_Up_H37.g.vcf.gz -V S_Up_H43.g.vcf.gz -V S_Up_H47.g.vcf.gz -V S_Up_H51.g.vcf.gz -V S_Up_H52.g.vcf.gz --genomicsdb-workspace-path crow_species/${chrom} \
        -L ${lists}/${chrom}.bed \
        --merge-input-intervals

#genotype gvcf
gatk GenotypeGVCFs -R ${genome} -L ${lists}/${chrom}.bed -V gendb://crow_species/${chrom} -O ../invariant_sites/species_${chrom}.vcf.gz -new-qual -all-sites -ploidy 2

#merge sites
bcftools concat --threads 10 -Oz -o crow.popseq.raw.vcf.gz species_*.vcf.gz

#retain only biallelic SNPs
bcftools view -v snps -M 2 crow.popseq.raw.vcf.gz -Oz -o crow.popseq.RAW.SNP.vcf.gz
tabix crow.popseq.RAW.SNP.vcf.gz

#subset invariant sites
bcftools view --threads 10 --max-alleles 1 -Oz -o crow.popseq.RAW.invariant.SNP.vcf.gz crow.popseq.RAW.SNP.vcf.gz
tabix crow.popseq.RAW.invariant.SNP.vcf.gz

#subset variant sites 
bcftools view --threads 10 --min-alleles 2 -Oz -o crow.popseq.RAW.variant.SNP.vcf.gz crow.popseq.RAW.SNP.vcf.gz
tabix crow.popseq.RAW.variant.SNP.vcf.gz

### { FILTERING SCRIPTS SEPARATE FOR VARIANT AND INVARIANT SITES, RE-MERGE INTO SINGLE GVCF } ### 

And then pixy:

file=$1
chr=$(awk '{print $1}' ${lists}/${file}.bed)
start=$(awk '{print $2}' ${lists}/${file}.bed)
end=$(awk '{print $3}' ${lists}/${file}.bed)

mkdir $SNIC_TMP/data
mkdir $SNIC_TMP/data/${file}

pixy --stats pi fst dxy \
--vcf crow.popseq.SNPs.filtered.vcf.gz \
--zarr_path $SNIC_TMP/data/zarr/${file} \
--chromosomes ${chr} \
--window_size 250 \
--bypass_filtration yes \
--populations SPECIES_CROW.txt \
--outfile_prefix output/${file}_out
kkorunes commented 4 years ago

@merondun, got it, thank you for the additional details. In your first post, I noticed that you had an input file with a g.vcf file extension, so I wanted to make sure you were inputting an allsites VCF rather than a gvcf. Would you be able to attach a sample of your data? Maybe we replicate your commands and pinpoint anything particularly inefficient pixy might be doing in your particular scenario.

Thanks!

merondun commented 3 years ago

Hey again. I just use g.vcf to differentiate between my allsites and variant only files, a relic of some old gatk flows and nomenclature

I noticed in another issue that you're planning on re-working the zarr database, so I'll consider this closed, since I think that will change this issue completely. It would be difficult to share the data for diagnostics since even a subsetted chromosome 1 is 3.5 gb (30 individuals, 79.4 million sites passed filters). I subsetted the whole genome vcf for chr1 (since zarr seemed to really slow down with the large scaffolds) and then ran a job on only the first 25 MB, the interval between 75 and 100 MB, and the entire chromosome.

Chromosome 1 in full took a little over 3 days to run and didn't seem to use more than 1.5 cores (n = 30, ~80 million sites) - I've also ran it with ~10/20 cores for shorter times but it didn't use it. I ran the 75 - 100 MB interval to make sure that it wasn't creating the zarr database for all sites leading up to the interval, which seems good for now. For the 25 MB regions it ran for a little under 3 hours (~120 hour whole-genome calculation for n=30, 1 Gb genome).

jobs

For the interval jobs, I have the genome divided into individual 25mb bed files, then I submit batch jobs with the bed file as a positional argument to pixy like this:

#!/bin/bash

#SBATCH -p core
#SBATCH -n 3
#SBATCH -t 100:00:00

file=$1
chr=$(awk '{print $1}' ${file}.bed)
starts=$(awk '{print $2}' ${file}.bed)
ends=$(awk '{print $3}' ${file}.bed)

mkdir $SNIC_TMP/data
mkdir $SNIC_TMP/data/${file}

pixy --stats pi fst dxy \
--vcf crow.popseq.SNPs.filtered.chr1.vcf.gz \
--zarr_path $SNIC_TMP/data/zarr/${file} \
--chromosomes ${chr} \
--interval_start ${starts} \
--interval_end ${ends} \
--window_size 250 \
--bypass_filtration yes \
--populations SPECIES_CROW.txt \
--outfile_prefix output/${file}_out

With the bed file simply:

cat Chr1_25mb.bed
chr1    0       25000000

So, your recommendation to divide it into regions works fine, although it takes a little extra legwork. Thanks!