odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
61 stars 9 forks source link

`[W::bcf_record_check] Bad BCF record` error when `phase_rare` indexes malformed output #34

Open BEFH opened 1 year ago

BEFH commented 1 year ago

phase_rare is failing while attempting to index the output, with errors like the following printed to stderr across all chunks:

[W::bcf_record_check] Bad BCF record at 28101822: Invalid CONTIG id -1

Upon trying to index the output bcf myself, it's clear the file is malformed, with the same error happening. Looking in the output, I have isolate the place in the file where it first errors (truncated from ~ 36k samples):

chr22   28101479        .       C       A       .       .       AC=1;AN=72722   GT:PP   0|0:.   0|0:.
[W::bcf_record_check] Bad BCF record at 28101822: Invalid CONTIG id -1
chr22   28101482        .       C       T       .       .       AC=4;AN=72722   GT:PP   0|0:.   0|0:.

This was obtained by running bcftools view [output_file].bcf | less -S, searching for 28101 and scrolling down until I saw the error.

Looking at the input file, there are no variants in between:

chr22   28101479    .   C   A   614.31  PASS    AC=1;AF=1.3751e-05;AN=72722;BaseQRankSum=-0.788;DP=1232241;ExcessHet=3.0103;FS=1.137;InbreedingCoeff=0.013;MLEAC=1;MLEAF=1.348e-05;MQ=60;MQRankSum=0;QD=13.65;ReadPosRankSum=0.206;SOR=0.904;VQSLOD=13.52;culprit=MQRankSum GT  0/0 0/0
chr22   28101482    .   C   T   1831.32 PASS    AC=4;AF=5.5004e-05;AN=72722;BaseQRankSum=1.65;DP=1231891;ExcessHet=3.0115;FS=2.381;InbreedingCoeff=0.0061;MLEAC=4;MLEAF=5.392e-05;MQ=60;MQRankSum=0;QD=15.52;ReadPosRankSum=0.228;SOR=0.482;VQSLOD=13.29;culprit=MQRankSum  GT  0/0 0/0

This was obtained by running bcftools view -r chr22:28101479-28101482 [input_file].bcf > dagnostic.vcf

The full command for phase_rare was as follows:

SHAPEIT5_phase_rare   --input ADSP.WGS.SNPs.compact_filter.combined.chr22.filtered.bcf   --map resources/shapeit/maps/chr22.b38.gmap.gz   --output ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf   --thread 42   --log ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.log   --scaffold ADSP.WGS.SNPs.compact_filter.combined.chr22.phased_common.bcf   --scaffold-region chr22:39664334-46825259   --input-region chr22:42325256-46325257

I got the following log (the HTSlib error was printed to stderr):


[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)
  * Authors       : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : simone.rubinacci@unil.ch & olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 5.1.1 / release = 2023-05-16
  * Run date      : 31/05/2023 - 16:24:16

Files:
  * Unphased data : [ADSP.WGS.SNPs.compact_filter.combined.chr22.filtered.bcf]
  * Scaffold data : [ADSP.WGS.SNPs.compact_filter.combined.chr22.phased_common.bcf]
  * Genetic Map   : [resources/shapeit/maps/chr22.b38.gmap.gz]
  * Output data   : [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf]
  * Output LOG    : [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.log]

Parameters:
  * Seed    : 15052011
  * Threads : 42 threads
  * PBWT    : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.1]
  * HMM     : [Ne = 15000 / Recombination rates given by genetic map]

Parsing specified genomic regions
  * Scaffold region  [chr22:39664334-46825259]
  * Input region  [chr22:42325256-46325257]

Reading genotype data
  * BCF scanning (10.49s)
      + Variant sites: [#scaffold=61709 | #rare=472804 | #all=534513]
      + 364951 rare variants in buffer regions excluded
      + 58173 variants found in both files [priority to scaffold]
  * HAP allocation [#scaffold=61709 / #samples=36361] (0.00s)
  * Genotype set allocation [#rare=472804 / #samples=36361] (0.04s)
  * Plain VCF/BCF parsing (111.70s)
      + Scaffold : [0/0=1953131181 0/1=194380741 1/1=96289027]
      + Rare     : [0/0=17157202585 0/1=1964700 1/1=947608 ./.=31511351]

Initializing sparse genotypes
  * 235020 missing genotypes imputed at monomorphic sites
  * Genotype set transpose V2I (0.50s)

Setting up genetic map
  * GMAP parsing [n=45141] (0.02s)
  * cM interpolation [s=7236 / i=527277] (0.04s)
  * Region length [7160576 bp / 11.9 cM]
  * HMM parameters [Ne=15000 / Error=0.0001]
  * V2H transpose (2.30s)

PBWT pass
  * PBWT initialization [#eval=61690 / #select=120] (0.00s)
  * IBD2 constraints [#inds=95 / #pairs=99] (27.08s)
  * PBWT forward selection (12.85s)
  * PBWT backward selection (14.08s)
      + #states=333.86+/-117.28
      + #collisions = 19466214 / #pushes = 15440439 / rate = 44.23%

HMM computations
  * Processing (3414.52s)
  * Genotype set transpose merge I2V [n=372364] (15.51s)
  * Solving summary:
      + #Hets=1964700 / Phased by HMM=1592336, PED=0, SING=372364
      + #Miss=31511351 / Imputed by HMM=31276331, PED=0, MONO=235020

Finalization:
  * BCF writing [Compressed / N=36361 / L=534513] (162.53s)
  * Indexing [ADSP.WGS.SNPs.compact_filter.combined.chr22.chunk__chr22_42325256-46325257.scaffchunk__chr22_39664334-46825259.phased_rare.bcf]

ERROR: Fail to index file
odelaneau commented 1 year ago

Can you check your input for position 28101822, please? It's 400 bp to the variants you reported. Just to see if there's a variant with this position.

BEFH commented 1 year ago

There's no variant there. There are variants at 28101821 and 28101823, but not 28101822. Also, the bad record is actually between 28101479 and 28101482 in the output file. There's nothing between those in the input file either.

giorkala commented 1 year ago

Hi, is there any solution to this? I'm also getting the same error,

[W::bcf_record_check] Bad BCF record at 10542449: Invalid CONTIG id -1

at the Finalization step, while phasing rare variants in our cohort (phasing the common ones was fine). I haven't tested much yet, but any help would be appreciated (happy to provide more info if needed).

Thanks, Yiorgos

srubinacci commented 1 year ago

Hi all,

I pushed a new commit on a new branch "contig". It should solve this issue: https://github.com/odelaneau/shapeit5/tree/contig

I produced two new static binaries you can find at this folder: https://www.dropbox.com/scl/fo/0gpiermhrtf7nz20pgao0/h?rlkey=ddpz9f4nhox1j77y923z4rhi3&dl=0

Can you please try to re-run phase_rare?. If that does not work, please rerun phase_common as well for the scaffolded region, and then phase_rare again?

Let me know if it solves the issue.

Simone

giorkala commented 1 year ago

Dear Simone,

I got the new binary and phasing was then complete without errors! Thanks a lot for providing the update. FYI, I didn't re-phase the common ones as I managed without (but will likely repeat all soon).

Kind regards, Yiorgos

BEFH commented 1 year ago

This is great news, and I will retest shortly.

On Wed, Jul 26, 2023 at 12:37 PM Yiorgos Kalantzis @.***> wrote:

Dear Simone,

I got the new binary and phasing was then complete without errors! Thanks a lot for providing the update. FYI, I didn't re-phase the common ones as I managed without (but will likely repeat all soon).

Kind regards, Yiorgos

— Reply to this email directly, view it on GitHub https://github.com/odelaneau/shapeit5/issues/34#issuecomment-1652155644, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ2Z2ELAO63B6PJMRQDBA3XSFBVDANCNFSM6AAAAAAYWCE54M . You are receiving this because you authored the thread.Message ID: @.***>

BEFH commented 11 months ago

Great news! I've been testing the changes and have not had any issues. I did not have to use the new phase_common; the new phase_rare was good enough.

I used the static binaries rather than recompiling.