odelaneau / shapeit4

Segmented HAPlotype Estimation and Imputation Tool
MIT License
90 stars 18 forks source link

Unable to phase chromosome X of a cohort VCF with ~2,000 sample #51

Closed tedyun closed 3 years ago

tedyun commented 3 years ago

THIS ISSUE IS NOW OBSOLETE. PLEASE SEE ISSUE #52 INSTEAD.

=====

Hello,

I am trying to use SHAPEIT4 for phasing a large cohort (~2,000 samples) of whole genome sequencing data and I'm running into issues when trying to phase chromosome X VCF. As far I know, the input VCF doesn't use any special representation for chromosome X, but SHAPEIT4 only fails on chrX with this message ERROR: No variants to be phased in [...vcf.gz] while it works well for other chromosomes.

This issues is reproducible with a public 30x WGS release of 1000 Genomes Project phase 3 by New York Genome Center. You can either download the variant calls by DeepVariant+GLnexus at this link in Google Cloud (+ .tbi), or the calls by GATK at this link in Google Cloud (+ .tbi) or official FTP, and run the following command (with the genetic map file included in SHAPEIT4):

$ shapeit4 \
  --input cohort-chrX.release.vcf.gz \
  --map chrX.b38.gmap.gz \
  --region chrX \
  --output cohort-chrX.release.phased.vcf.gz \
  --thread $(nproc) \
  --log shapeit4_output_dvglx_chrX.txt \
  --sequencing

Full output log can be found here. I'm running this in a Debian/Ubuntu machine with intel Xeon CPU. SHAPEIT4 binary was compiled with htslib v1.9.

I have tried the following to fix this issue but all failed:

  1. Changing --pbwt-mdr value to a higher value.
  2. Changing the genetic map file to use chrX instead of X to match the chromosome name in VCF.
  3. Manually changing the chromosome name from chrX to chr1 in both VCF and the genetic map file.
  4. Converting all missing genotype calls (./.) to 0/0.
  5. Skipping the --map flag to use the default flat linkage structure.
  6. Restricting the number of samples to 1000.

I'm using SHAPEIT v4.1.2 for this run, but I also tried v4.2.0 and it failed with the same error message. Just as a reference, I also tried phasing the same VCF with Eagle v2.4.1 and it did work without an error.

I'm not sure if I missed anything obvious but I ran out of ideas to try - it'd be great if I can get some advice on where else I can look to investigate this issue.

Thank you very much for your work in developing this awesome software.

Best, Ted

tedyun commented 3 years ago

It turns out my binary must have had some issues - once I recompiled SHAPEIT4 and htslib with a clean source, I can get past the initialization step. Phasing still fails in the HMM computation step, but since that is a different issue I'll open a new bug after more investigation. Thank you for taking a look.