odelaneau / GLIMPSE

Low Coverage Calling of Genotypes
MIT License
138 stars 26 forks source link

error in glimpse phase step #185

Closed felassadi closed 11 months ago

felassadi commented 1 year ago

Hello,

I am running GLIMPSE on low-coverage (1x) whole genome files. I prepared the files in terms of QC & trimming. I aligned them to the reference genome: GCF_000001405.40_GRCh38.p14_genomic.fna using bwa mem. Then I prepared the reference haplotypes using gnomad.genomes.v3.1.2.sites for each chromosome. I followed the documentation as it is except that I removed the part where you remove the sample from the reference panel, because my samples refer to local patients that are not included in gnomad haplotypes. After generating the VCF containing GLs, I ran glimpse-phase and I got this error: "Number of samples in the reference panel: 0, exclusive samples (not in the target panel): 0" I also tried to run the same files using 1000G-haplotypes reference panel (aligned to ref genome 38), I got this error: "No variants to be imputed in files". I looked into my BAM files, they look fine. I checked the md5 number of gnomad files, they are the same as on the website. I repeated the alignment of the bam file and prepared the haplotypes again, I got the same error. Can you please advise what could be the problem? Thank you.

srubinacci commented 12 months ago

Hi, You might need to post a log file for me to try to understand where the problem can be. A very common error users make is the region encoding, e.g. chr22 instead of 22 for GRch37. But this is pretty much a blind guess, please post a log file.

Best,

Simone

felassadi commented 12 months ago

Thanks Dr. Simone for your reply. Yes, I made sure I have the same chromosome names in the input and reference files and I checked the files after renaming, but the issue is not there. I checked previous issues on github, and I found out that someone had the same issue and you asked them to use bcftools query -l to check the number of samples in the reference panel. When I tried it, I found no samples at all, although I am using gnomad haplotype panels. Do you think if I add random sample IDs to the reference panels file, would it work?

srubinacci commented 12 months ago

Hi,

Thanks for the explanation. I believe that you are trying to use as a reference panel a file containing only allele frequencies. Unless you have specific access to Gnomad, the publicly available version only contains allele frequencies and no genotype data. Therefore GLIMPSE correctly detects that there are no samples in the reference panel.

If you want to use publicly available data, you might want to use 1000 Genomes project or the recent release of 1000GP+HGDP as a reference panel for imputation instead (the latter is available here: gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes_v2/).

Some description of imputation accuracy using this dataset has been done by the company Gencove here: https://gencove.com/blog/an-updated-evaluation-of-reference-panel

Hope this solves the issue. Next time, in case you encounter another problem, please try to include log files in the message, I would have spotted this immediately (just by reading the filename of your reference panel files and by reading the error message).

Best,

Simone

felassadi commented 12 months ago

Thank you Dr. Simone for your advice. Sorry that I haven't attached a log file, but I didn't have one. I was trying to use gnomAD v3.1.2 datasets (The variant dataset files contain all subsets (non-cancer, non-neuro, non-v2, non-TOPMed, controls/biobanks, 1KG, and HGDP). I wanted it to use it as it is more comprehensive. However, it didn't work for me as GLIMPSE needs to recognize sample IDs in the panel. I will follow your advice and try to use HGDP + 1KG callset only. I will contact you again in case it didn't work for me.

srubinacci commented 11 months ago

Hi, just to clarify, the publicly available version of gnomAD does not contain genotype data. It's not about sample IDs not recognised by GLIMPSE, the problem is that the file only contains summary statistics. HGDP + 1KG is the best option to date with publicly available data.

Best,

Simone