odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
90 stars 18 forks source link

Unable to phase chromosome X of a cohort VCF with ~2,000 sample (updated) #52

Open tedyun opened 3 years ago

tedyun commented 3 years ago

Hello,

I am trying to use SHAPEIT4 for phasing a large cohort (~2,000 samples) of whole genome sequencing data and I'm running into issues when trying to phase chromosome X VCFs. As far I know, the input VCFs don't use any special representation for chromosome X, but SHAPEIT4 only fails on chrX without any error message, while it works well with other chromosomes. It seems to always fails in the "HMM computations" step in the first burn-in iteration "Burn-in iteration [1/5]".

This issues is reproducible with a public 30x WGS release of 1000 Genomes Project phase 3 by New York Genome Center. You can either download the variant calls by DeepVariant+GLnexus at this link in Google Cloud (+ .tbi), or the calls by GATK at this link in Google Cloud (+ .tbi) or official FTP, and run the following command (with the genetic map file included in SHAPEIT4):

$ shapeit4 \
  --input CCDG_13607_B01_GRM_WGS_2019-02-19_chrX.recalibrated_variants.vcf.gz \
  --map chrX.b38.gmap.gz \
  --region chrX \
  --output phased_1kgp_gatk_chrX.vcf.gz \
  --thread $(nproc) \
  --log gatk_shapeit413_chrX.txt \
  --sequencing

Full output log can be found here using SHAPEIT v4.1.3. I didn't see any error message other than "Killed". I tried this in multiple Debian/Ubuntu machines with intel Xeon CPU. SHAPEIT4 binary was compiled with htslib v1.9.

I have tried the following to fix this issue but all failed:

  1. Changing --pbwt-mdr value to a higher value.
  2. Converting all missing genotype calls (./.) to 0/0.
  3. Skipping the --map flag to use the default flat linkage structure.

I'm using SHAPEIT v4.1.3 for this run, but I also tried v4.2.1 and it failed with the same error message. Just as a reference, I also tried phasing the same VCF with Eagle v2.4.1 and it did work without an error.

It'd be great if I can get some advice on how to fix this issue.

Thank you very much for your work in developing this awesome software.

Best, Ted

odelaneau commented 3 years ago

Hi Ted,

Thanks for your detailed message.

Just a question, did you monitor RAM usage? Phasing >4M variants in a single shot requires huge RAM I guess. One way to see if this causes the problem is to run a smaller chunk of data using --region chrY:start-end.

Please let me know if it works. If not, I'll have a look at the data.

Best,

sandra-selfdecode commented 3 years ago

Should chrX be forced diploid, or is it ok to have haploid males and diploid females?

odelaneau commented 3 years ago

Should be fine if you enforce diploid for males.

It's been too long on my todo list this chrX... Sorry.

Youpu-Chen commented 9 months ago

Hi there. I was wondering whether forcing male chrX to diploids is an issue, Would this process cause some problems? I was doing some popgen analysis. If you could give me a hint, I'd really appreciate it.

So did the old version allow a ploidy-sensitive mode? : https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gettingstarted