klzhang2022 commented 1 year ago

When we execute GLIMPSE2_phase, each chunk reports the same error: 0mStates for individual 0 are zero. Error during selection. The three steps of step2_script_reference_panel.sh, step3_script_chunk.sh, and step4_script_split_reference.sh before step5_script_impute.sh are normal, and we don’t know where the problem is.

When performing imputation, the reference file is a 28-sample bcf file, and the file to be imputed is a 20-sample 1× bam file obtained by downsampling, and the above error is displayed. We also tried the reference file with 988 samples, the error is the same.

Because we are a group of pigs, 18 chromosomes run in a loop, and we set ne to 100. The specific execution code is as follows:

!/bin/bash

mkdir -p GLIMPSE_impute

REF=reference_panel/split/sorted_1_downsampled_DY-1_20 BAM=1_DY-1_20/sorted_1_downsampled_DY-1_20.bam

for i in {1..18} do while IFS="" read -r LINE || [ -n "$LINE" ]; do printf -v ID "%02d" $(echo $LINE | cut -d" " -f1) IRG=$(echo $LINE | cut -d" " -f3) ORG=$(echo $LINE | cut -d" " -f4) CHR=$(echo ${LINE} | cut -d" " -f2) REGS=$(echo ${IRG} | cut -d":" -f 2 | cut -d"-" -f1) REGE=$(echo ${IRG} | cut -d":" -f 2 | cut -d"-" -f2) OUT=GLIMPSE_impute/1_downsampled_DY-1_20_imputed /public/zhangkl/software/glimpse/phase/bin/GLIMPSE2phase --bam-file ${BAM} --reference ${REF}${CHR}${REGS}${REGE}.bin --ne 100 --output ${OUT}${CHR}${REGS}_${REGE}.bcf done < chunks.chr${i}.txt done

klzhang2022 commented 1 year ago

The running log is as follows:

[GLIMPSE2] Phase and impute low coverage sequencing data

Authors : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
Contact : simone.rubinacci@unil.ch & olivier.delaneau@unil.ch
Version : GLIMPSE2_phase v2.0.0 / commit = 2555fb0 / release = 2023-02-07
Citation : BiorXiv, (2022). DOI: https://doi.org/10.1101/2022.11.28.518213
: Nature Genetics 53, 120–126 (2021). DOI: https://doi.org/10.1038/s41588-020-00756-0
Run date : 30/03/2023 - 11:22:31

Files:

Input BAM/CRAM : [1_DY-1_20/sorted_1_downsampled_DY-1_20.bam]
Reference binary : [reference_panel/split/sorted_1_downsampled_DY-1_20_1_1_11658833.bin]
Output file : [GLIMPSE_impute/1_downsampled_DY-1_20_imputed_1_1_11658833.bcf]
Output format : [BCF format | ZLIB compression]

GLIMPSE_phase parameters:

Imputation model : [Reference panel imputation]
Input region : [Given by binary reference panel]
Output region : [Given by binary reference panel]
Sparse MAF : [Given by binary reference panel]
Recombination rates : [Given by binary reference panel]
Ploidy : [Only diploid samples in region]
Keep monom. ref sites: [NO]

Model parameters:

Burnin iterations : [5]
Main iterations : [15]
Ne [eff. pop. size] : [100000]
Phase error rate : [0.0001]
Imputation error rate: [1e-12]
Min value for hap GLs: [1e-10]

Selection parameters:

K init : [1000]
K pbwt : [2000]
PBWT depth : [12]
PBWT modulo (cM) : [0.1]
State list : [No list provided]

Genotype calling:

Calling model : [standard]
Perform calling]

BAM/CRAM filters and options:

Min mapping quality : [10]
Min base quality : [10]
Max depth : [40]
Keep failed QC : [NO]
Keep orphan reads : [NO]
Keep duplicate reads : [NO]
Keep supp. alignment : [NO]
Check pairing : [NO]
Ignore orientation : [NO]
Illumina-1.3+ : [NO]

Other parameters

Seed : [15052011]
Threads : [1]

Initialisation:

Binary reference panel parsing ...
Binary reference panel parsing [done] (0.07s)
Binary reference panel [Nrh=56] [L=33001] [Lrare= 0 (0.0%) - Lcommon= 33001 (100.0%)]
Input region : [1:1-11658833]
Output region : [1:1-10552050]
Sparse MAF : [0.001]
Reading BAM files
Reading BAM files [100%]
Reading BAM files done (3.17s)
Coverage statistics reported in: [GLIMPSE_impute/1_downsampled_DY-1_20_imputed_1_1_11658833_stats_coverage.txt.gz]
Avg. sequencing coverage: 13.840005
RARE transpose Ref (0.00s)
Region spans 11657752 bp and 11.66 cM
No PBWT allocated (Kpbwt=0 or Kpbwt >= n_ref_haps)

Initializing iteration

Init rare Tar [#non-ref gls: 0] (0.00s)
HMM imputation [#states=56.0 / %poly=100.0%] (0.04s)

Burn-in iteration [1/5]

HAP update Tar (0.00s)
RARE transpose Tar (0.00s)
No PBWT selection (Kpbwt=0 or Kpbwt >= n_ref_haps)

[31mERROR: [0mStates for individual 0 are zero. Error during selection.

klzhang2022 commented 1 year ago

At first, we thought that the information contained in the bam file was too little, but looking at the number of SNPs called out by the bam file, 18 chromosomes contained a total of 1,161,192 SNPs, and chromosome 1 contained 97,582 SNPs, which we thought reached the minimum for imputation need.

srubinacci commented 1 year ago

Hi, I think this might the same bug reported here (#121) that is now fixed in the latest commit version. Can you please compile the latest master version and see if that solves the problem? Otherwise let me know if you need new static binaries.

klzhang2022 commented 1 year ago

Thanks for your reply, after using the new version, the imputation works successfully.

We have another question, how to run multi-sample imputation (multiple bam files) ? Set up a multi-sample for loop in the shell, and finally merge the imputed bcf file of each chromosome of each sample through "bcftools merge sample1.bcf sample2.bcf ... sampleN.bcf -o merged.bcf" ?

srubinacci commented 1 year ago

You can either have a multi-sample genotype likelihood file in a single VCF/BCF file or use the --bam-list option, where you specify a text file with the bam file locations (and optionally sample names as a second column)

klzhang2022 commented 1 year ago

Thank you for your reply, according to your recommendation, we have solved the problem.

Here, we have two points to confirm with you:

When using GLIMPSE2 for low-depth sequencing imputation, there are two options for reference files: VCF/BCF files with AC/AN fields, VCF/BCF needs to be indexed, and --input-region and --output-region must be specified when imputation; the bin (binary format) obtained by VCF/BCF through chunk and split_reference, and the imputation area cannot be specified at this time.

There are two options for imputation files: VCF/BCF file containing the genotype likelihoods; BAM/CRAM file; VCF/BCF and BAM/CRAM must have an index file. For BAM/CRAM with multiple samples, we can use --bam -list parameter.

Two kinds of reference files and two kinds of files to be imputed can be combined arbitrarily.

We want to put GLIMPSE2 on the backend of the database website to provide users with imputation services. The imputation area is specified by the user. Suppose we receive the imputation area input by the user from the browser as chr1:1-100000. When imputation, can we set the same value as chr1:1-100000 for --input-region and --output-region, which means without buffers.

srubinacci commented 1 year ago

Hi, Sorry for coming back only now. Not sure I follow and I understood the problem, my apologies in case.

For 2. you will only need to generate the chunks once for a specific reference panel. Once that is done, these chunk can be used for every user that want to perform imputation on that panel. Except very specific cases you do not want to let the user to specify the imputation region, you can fully precompute that and have the reference panel stored as a set bin files.

If for some specific reason you need the user to specify the region (be careful, this does not seem a normal use-case), you can just use the vcf.gz/bcf file format as a reference panel. However, I feel like you might not need this latter case.

Best,

Simone

odelaneau / GLIMPSE

GLIMPSE2_phase has an error: 0mStates for individual 0 are zero. Error during selection. #143

!/bin/bash

Burnin iterations : [5]

Main iterations : [15]

Threads : [1]