odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Can only run a few base positions on chromosome X #102

Open milena-andreuzo opened 3 weeks ago

milena-andreuzo commented 3 weeks ago

Hello! I'm new to SHAPEIT and am trying to use it to phase my array data.

Initially I installed SHAPEIT5 from conda on a docker image and ran phase_common on a few chromosomes already (such as chr1, chr19, chr22) using AWS instances. Documentation says that for running on chromosome X the only difference from the other chromosomes is the presence of a text file containing the haploid samples with --haploids argument. Despite that, I haven't been able to run phasing on chromosome X. Describing my first test:

Resulting log:

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 06/06/2024 - 15:12:30

Files:
  * Input         : [sample.vcf.gz]
  * Reference     : [reference_panels/GRCh37/chrX.vcf.gz]
  * Haploids      : [haploids.txt]
  * Genetic Map   : [gmaps/GRCh37/chrX.gmap.gz]
  * Output        : [sample.X.phased.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 4 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:
305 Segmentation fault      (core dumped) 

I don't believe it is a memory issue, since chromosome 1 and the others were processed on the same conditions without any problems.

I tried many solutions already... some of them:

None of the above worked. I could however get a result when running on one or two positions... This worked:

SHAPEIT5_phase_common --input sample.vcf.gz --reference reference_panels/GRCh37/chrX.vcf.gz --region X:38009121,X:106244767 --map gmaps/GRCh37/chrX.gmap.gz --haploids haploids.txt --output sample.X.phased.bcf

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 10/06/2024 - 19:40:39

Files:
  * Input         : [sample.vcf.gz]
  * Reference     : [reference_panels/GRCh37/chrX.vcf.gz]
  * Haploids      : [haploids.txt]
  * Genetic Map   : [gmaps/GRCh37/chrX.gmap.gz]
  * Output        : [sample.X.phased.bcf]
  * Output format : [bcf]

Parameters:
  * Seed    : 15052011
  * Threads : 1 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 15000 / Recombination rates given by genetic map]

Reading genotype data:
  * VCF/BCF scanning done (1.83s)
      + Variants [#sites=1 / region=X:38009121,X:106244767]
         - 1 sites removed in main panel [not in reference panel]
      + Samples [#target=24 / #reference=2504]
  * VCF/BCF parsing done (1.40s)
      + Genotypes [0/0=91.667%, 0/1=8.333%, 1/1=0.000%, ./.=0.000%, 0|1=0.000%]
      + Reference haplotypes [0=98.383%, 1=1.617%]
  * Haploid parsing [n=1747] (0.00s)
  * Haploid sample mapping (0.00s)
      + #haploids = 7 / #diploids = 17
      + Hets set as missing: n=0 (0.000%)

Setting up genetic map:
  * GMAP parsing [n=89589] (0.57s)
  * cM interpolation [s=0 / i=1] (0.00s)
  * Region length [1 bp / 0.0 cM]
  * HMM parameters [Ne=15000 / Error=0.0001 / #rare=0]

Initializing data structures:
  * Impute monomorphic [n=0] (0.00s)
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * PBWT parameters auto setting : [modulo = 0.049 / depth = 6]
  * PBWT initialization [#eval=1 / #select=1 / #chunk=1] (0.00s)
  * PBWT phasing sweep (0.00s)
  * Build genotype graphs [seg=24] (0.00s)

Burn-in iteration [1/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.4+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [2/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [3/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.3+/-1.6 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [4/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Burn-in iteration [5/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Burn-in iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Burn-in iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Pruning iteration [1/1]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * Trimming [pc=0.00%]

Main iteration [1/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [2/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [3/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.5+/-1.5 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [4/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.2+/-1.8 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Main iteration [5/5]
  * PBWT selection (0.00s)
  * HMM computations [K=11.3+/-1.6 / W=0.00Mb / US=0 / UP=0] (0.00s)
  * IBD2 tracks [#inds=1 / #tracks=1 / #merged = 0]
  * HAP update (0.00s)
  * H2V transpose (0.00s)

Finalization:
  * HAP solving (0.00s)
  * HAP update (0.00s)
  * H2V transpose (0.00s)
  * VCF/BCF writing [N=24 / L=1] (0.01s)
  * Indexing files
  * Total running time = 3 seconds

I think I may be missing some silly detail... Ideally I would get the result for the full chromosome to get through with imputation... So I ran out of ideas. Can you help me? Thank you!

dtaliun commented 3 weeks ago

Hi @milena-andreuzo, When you run chrX non-PAR in chunks, did all the chunks fail or only certain ones?

milena-andreuzo commented 3 weeks ago

Hi @dtaliun! Tested the chunks here on chrX non-PAR (main and reference).

Regions X:1-27440787, X:24026048-50656185 and X:47743520-88764754 went straight to that same segmentation fault error. The others, X:78359646-115258184, X:112836207-139441710 and X:137290831-1000000000 returned sites removed from the main panel because they are not on reference panel and so had no variants to be phased:

Reading genotype data:
  * VCF/BCF scanning done (0.06s)
      + Variants [#sites=0 / region=X:137290831-1000000000]
         - 4762 sites removed in main panel [not in reference panel]

ERROR: No variants to be **phased!**

All those chunks do have variants in the input file.

milena-andreuzo commented 3 weeks ago

If I were to run chrX on shapeit5 v1.0.0/v5.0.0, what pre-processing steps would you suggest?