odelaneau / GLIMPSE

Low Coverage Calling of Genotypes
MIT License
140 stars 26 forks source link

Imputation rate #193

Closed acty2323 closed 8 months ago

acty2323 commented 11 months ago

Hi,

I would like to ask, in the imputed data, less than 10% of the SNPs are in both the imputed vcf and the reference vcf. Is this imputation rate reasonable?

Thanks Penny

srubinacci commented 11 months ago

Hi Penny,

No, it sounds pretty suspicious. This should not happen if genotype likelihoods have been computed at all reference panel positions. I recommend you to recompute genotype likelihoods at each variant in the reference panel, as specified in the tutorial. Every variant in the reference panel should be present in your genotype likelihoods (potentially with missing GT and PLs) file. This will result in an imputed file with all the original variants.

Can you do this?

Simone

srubinacci commented 11 months ago

I have just noticed from a previous answer that you actually don't use genotype likelihoods, but allow GLIMPSE to call them from CRAM files.

Then there's definitively something not working properly. My guess is that you need to double check your "split_reference" jobs. I think that somehow that step did not work well.

Best,

Simone

acty2323 commented 11 months ago

Hi,

Sorry, I am following the tutorial, but I can't find which step is wrong. Below is my code, could you tell me how to correct it or should I check the files I am using?

## Remove NA12878 (and family) from the reference panel and perform basic QC
bcftools norm -m -any CCDG_14151_B01_GRM_WGS_2020-08-05_chr16.filtered.shapeit2-duohmm-phased.vcf.gz -Ou --threads 4 |
bcftools view -m 2 -M 2 -v snps -s ^NA12878,NA12891,NA12892 --threads 4 -Ob -o chr16/1000GP.chr16.noNA12878.vcf.gz; done
bcftools index -f chr16/1000GP.chr16.noNA12878.vcf.gz --threads 4

## Extract sites from the reference panel
bcftools view -G -Oz -o chr16/1000GP.chr16.noNA12878.sites.vcf.gz chr16/1000GP.chr16.noNA12878.vcf.gz
bcftools index -f chr16/1000GP.chr16.noNA12878.sites.vcf.gz

## Split the genome into chunks
GLIMPSE2_chunk --input chr16/1000GP.chr16.noNA12878.sites.vcf.gz --map chr16.b38.gmap.gz --region chr16 --window-cm 0.05 --sequential --output chunks_chr${i}.txt

## Split_reference
MAP=chr16.b38.gmap.gz
VCF=chr16.filtered.chgchr.split.vqsr.AB.vcf.gz
REF=chr16/1000GP.chr16.noNA12878.vcf.gz
while IFS="" read -r LINE || [ -n "$LINE" ];
do
  printf -v ID "%02d" $(echo $LINE | cut -d" " -f1)
  IRG=$(echo $LINE | cut -d" " -f3)
  ORG=$(echo $LINE | cut -d" " -f4)

  GLIMPSE2_split_reference --reference ${REF} --map ${MAP} --input-region ${IRG} --output-region ${ORG} --output chr16/1000GP.chr16.noNA12878
done < chunks_chr16.txt

Thanks Penny

srubinacci commented 10 months ago

Sorry Penny, cannot tell you where the error is here.

You might need to look at the log of your split_reference jobs (re-running it if necessary).