odelaneau / GLIMPSE

Low Coverage Calling of Genotypes
MIT License
129 stars 26 forks source link

GLIMPSE2_phase has an error: 0mStates for individual 0 are zero. Error during selection. #143

Closed klzhang2022 closed 1 year ago

klzhang2022 commented 1 year ago

When we execute GLIMPSE2_phase, each chunk reports the same error: 0mStates for individual 0 are zero. Error during selection. The three steps of step2_script_reference_panel.sh, step3_script_chunk.sh, and step4_script_split_reference.sh before step5_script_impute.sh are normal, and we don’t know where the problem is.

When performing imputation, the reference file is a 28-sample bcf file, and the file to be imputed is a 20-sample 1× bam file obtained by downsampling, and the above error is displayed. We also tried the reference file with 988 samples, the error is the same.

Because we are a group of pigs, 18 chromosomes run in a loop, and we set ne to 100. The specific execution code is as follows:

!/bin/bash

mkdir -p GLIMPSE_impute

REF=reference_panel/split/sorted_1_downsampled_DY-1_20 BAM=1_DY-1_20/sorted_1_downsampled_DY-1_20.bam

for i in {1..18} do while IFS="" read -r LINE || [ -n "$LINE" ]; do printf -v ID "%02d" $(echo $LINE | cut -d" " -f1) IRG=$(echo $LINE | cut -d" " -f3) ORG=$(echo $LINE | cut -d" " -f4) CHR=$(echo ${LINE} | cut -d" " -f2) REGS=$(echo ${IRG} | cut -d":" -f 2 | cut -d"-" -f1) REGE=$(echo ${IRG} | cut -d":" -f 2 | cut -d"-" -f2) OUT=GLIMPSE_impute/1_downsampled_DY-1_20_imputed /public/zhangkl/software/glimpse/phase/bin/GLIMPSE2phase --bam-file ${BAM} --reference ${REF}${CHR}${REGS}${REGE}.bin --ne 100 --output ${OUT}${CHR}${REGS}_${REGE}.bcf done < chunks.chr${i}.txt done

klzhang2022 commented 1 year ago

The running log is as follows:

[GLIMPSE2] Phase and impute low coverage sequencing data

Files:

GLIMPSE_phase parameters:

Model parameters:

Selection parameters:

Genotype calling:

BAM/CRAM filters and options:

Other parameters

Initialisation:

Initializing iteration

Burn-in iteration [1/5]

ERROR: States for individual 0 are zero. Error during selection.

klzhang2022 commented 1 year ago

At first, we thought that the information contained in the bam file was too little, but looking at the number of SNPs called out by the bam file, 18 chromosomes contained a total of 1,161,192 SNPs, and chromosome 1 contained 97,582 SNPs, which we thought reached the minimum for imputation need.

srubinacci commented 1 year ago

Hi, I think this might the same bug reported here (#121) that is now fixed in the latest commit version. Can you please compile the latest master version and see if that solves the problem? Otherwise let me know if you need new static binaries.

klzhang2022 commented 1 year ago

Thanks for your reply, after using the new version, the imputation works successfully.

We have another question, how to run multi-sample imputation (multiple bam files) ? Set up a multi-sample for loop in the shell, and finally merge the imputed bcf file of each chromosome of each sample through "bcftools merge sample1.bcf sample2.bcf ... sampleN.bcf -o merged.bcf" ?

srubinacci commented 1 year ago

You can either have a multi-sample genotype likelihood file in a single VCF/BCF file or use the --bam-list option, where you specify a text file with the bam file locations (and optionally sample names as a second column)

klzhang2022 commented 1 year ago

Thank you for your reply, according to your recommendation, we have solved the problem.

Here, we have two points to confirm with you:

  1. When using GLIMPSE2 for low-depth sequencing imputation, there are two options for reference files: VCF/BCF files with AC/AN fields, VCF/BCF needs to be indexed, and --input-region and --output-region must be specified when imputation; the bin (binary format) obtained by VCF/BCF through chunk and split_reference, and the imputation area cannot be specified at this time.

There are two options for imputation files: VCF/BCF file containing the genotype likelihoods; BAM/CRAM file; VCF/BCF and BAM/CRAM must have an index file. For BAM/CRAM with multiple samples, we can use --bam -list parameter.

Two kinds of reference files and two kinds of files to be imputed can be combined arbitrarily.

  1. We want to put GLIMPSE2 on the backend of the database website to provide users with imputation services. The imputation area is specified by the user. Suppose we receive the imputation area input by the user from the browser as chr1:1-100000. When imputation, can we set the same value as chr1:1-100000 for --input-region and --output-region, which means without buffers.
srubinacci commented 1 year ago

Hi, Sorry for coming back only now. Not sure I follow and I understood the problem, my apologies in case.

For 2. you will only need to generate the chunks once for a specific reference panel. Once that is done, these chunk can be used for every user that want to perform imputation on that panel. Except very specific cases you do not want to let the user to specify the imputation region, you can fully precompute that and have the reference panel stored as a set bin files.

If for some specific reason you need the user to specify the region (be careful, this does not seem a normal use-case), you can just use the vcf.gz/bcf file format as a reference panel. However, I feel like you might not need this latter case.

Best,

Simone