odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Reproducible results #58

Open zhangpicb opened 10 months ago

zhangpicb commented 10 months ago

Dear @jaredo @diogomribeiro @odelaneau @srubinacci @rwk-unil

Thanks for this great tools.

I find that the shapeit5 outputs are different when I use same input data with same default seed 15052011.

How to make the shapeit5 results reproducible?

working_dir=$1
CUT=0.001
THREADS=64

mkdir -p ${working_dir}/output

for CHR in {1..22}; do
        MAP=/tools/shapeit5/resources/maps/b38/chr${CHR}.b38.gmap.gz
        VCF=${working_dir}/chrsplit/chr${CHR}.vcf.gz
        OUT=${working_dir}/output/chr${CHR}.${CUT}.bcf
        LOG=${working_dir}/output/chr${CHR}.${CUT}.log
        TIM=${working_dir}/output/chr${CHR}.${CUT}.time
        phase_common_static --input $VCF --map $MAP --output $OUT --thread $THREADS --log $LOG --filter-maf $CUT --region chr${CHR}  --output-format bcf && bcftools index -f $OUT --threads $THREADS

        wait
        OUTVCF=${working_dir}/output/chr${CHR}.vcf.gz
        bcftools convert ${OUT} -Oz -o ${OUTVCF}
        tabix -p vcf ${OUTVCF}

done

By the way, Impute5 output are same when I use same input .

Thanks in advanced!

Best, zhang

srubinacci commented 10 months ago

Hi,

This is expected behaviour. Even though the RNG has been set with the right seed, you are using multi-threading causing the RNG to be accessed independently and unpredictably by multiple threads. If you analysis needs to be reproducible, you should run SHAPEIT5 with one thread.

The reason why IMPUTE5 has always the same output is because the SNP array imputation is actually deterministic and not stochastic (so no calls to the RNG).

Hope this helps, Simone

zhangpicb commented 10 months ago

Dear @srubinacci

Thanks for your reply!

I set THREADS=1 in my scripts,and still get 2 different results after imputation.

Thanks in advanced!

Best, zhang

srubinacci commented 10 months ago

Hi,

Happy to help. Do you mean after phasing, with threads=1? This should not be the case, if the same seed is set. We always use the same RNG, so the stochastic calls should be reproducible now.

Can you please check carefully and if that is the case in your tests, and if so, report a reproducible example with some public data if that is the case? In my experiments, I have never experienced this problem with single thread.

Thanks

Simone

zhangpicb commented 10 months ago

Dear @srubinacci @jaredo @diogomribeiro @odelaneau

This is reproducible problems with 1KGP data.

unphased chr22 data download from https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/cgi_variant_calls/

This is my code. And same Linux server get 2 different results.

nohup phase_common_static --input 1000genomes/20130502_supporting/cgi_variant_calls/filtered_calls/ALL.chr22.cgi_multi_sample.20130725.snps_indels.high_coverage_cgi.normalized.uniq.genotypes.gtonly.cr90.ic10.vcf.gz \
--map /tools/shapeit5/resources/maps/b37/chr22.b37.gmap.gz --output chr22.bcf --thread 1 --log chr22.log --filter-maf 0.001 \
--region 22  --output-format bcf && bcftools index -f chr22.bcf --threads 1 &

bcftools convert chr22.bcf -Oz -o chr22.vcf.gz
tabix -p vcf chr22.vcf.gz
zcat chr22.vcf.gz|grep -v ^# >  chr22.txt
md5sum chr22.txt

The results,the last person are different

22      16555011        .       C       T       .       .       AC=36;AN=866    GT      0|0     0|0     0|1     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|1     0|0     1|0                      
22      16555011        .       C       T       .       .       AC=36;AN=866    GT      0|0     0|0     1|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     1|0     0|0     1|0                         

Thanks in advanced!

zhangpicb commented 8 months ago

Dear @srubinacci @jaredo @diogomribeiro @odelaneau

Results are not reproducible. This is a bug or something else?

Thanks in advanced!