odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
65 stars 9 forks source link

phase_rare_static: Assertion `!isnan(GRvar_genotypes[vr][tidx].prob)' failed. #33

Open schnabelr opened 1 year ago

schnabelr commented 1 year ago

I ran into this when doing some testing and I haven't been able to figure out what the issue is so I figured I'd post. These are cows. Below is the beginning of the log for phase_common which runs fine.

[SHAPEIT5] phase_common (jointly phase multiple common markers)
  * Author        : Olivier DELANEAU, University of Lausanne
  * Contact       : olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/05/2023 - 22:48:22

Files:
  * Input         : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT_INDIV/25/WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Genetic Map   : [/cluster/VAST/schnabeltest/GENETIC_MAP/9913_ars1.2/recombination_map.25.gmap]
  * Output        : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Output format : [bcf]
  * Output LOG    : [TEST007.25.4.32.Common.log]

Parameters:
  * Seed    : 15052011
  * Threads : 32 threads
  * MCMC    : 15 iterations [5b + 1p + 1b + 1p + 1b + 1p + 5m]
  * PBWT    : [window = 4cM / depth = auto / modulo = auto / mac = 5 / missing = 0.1]
  * HMM     : [window = 4cM / Ne = 1500 / Recombination rates given by genetic map]
  * FILTERS : [snp only = 0 / MAF = 0.002]

Reading genotype data:
  * VCF/BCF scanning done (39.92s)
      + Variants [#sites=198884 / region=25:19885090-25996793]
         - 62304 sites removed in main panel [multi-allelic]
         - 222099 sites removed in main panel [below MAF threshold]
      + Samples [#target=6320 / #reference=0]
  * VCF/BCF parsing done (31.16s)
      + Genotypes [0/0=88.448%, 0/1=5.162%, 1/1=5.277%, ./.=1.112%, 0|1=0.000%]

Setting up genetic map:
  * GMAP parsing [n=1165] (0.02s)
  * cM interpolation [s=138 / i=198746] (0.02s)
  * Region length [6111615 bp / 11.1 cM]
  * HMM parameters [Ne=1500 / Error=0.0001 / #rare=0]

Initializing data structures:
  * Impute monomorphic [n=0] (0.00s)
  * HAP update (5.75s)
  * H2V transpose (1.81s)
  * PBWT parameters auto setting : [modulo = 0.058 / depth = 5]
  * PBWT initialization [#eval=197995 / #select=192 / #chunk=1] (0.03s)
  * PBWT phasing sweep (37.69s)
  * Build genotype graphs [seg=21630267] (0.49s)

Below is the log from phase_rare

[SHAPEIT5] Phase_rare (phase rare variants onto a haplotype scaffold)
  * Authors       : Simone RUBINACCI & Olivier DELANEAU, University of Lausanne
  * Contact       : simone.rubinacci@unil.ch & olivier.delaneau@gmail.com
  * Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08
  * Run date      : 18/05/2023 - 23:32:17

Files:
  * Unphased data : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT_INDIV/25/WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Scaffold data : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Genetic Map   : [/cluster/VAST/schnabeltest/GENETIC_MAP/9913_ars1.2/recombination_map.25.gmap]
  * Output data   : [/cluster/VAST/schnabeltest/230101_allbos_OLD/PHASED_OUTPUT/25/4.Common.WH.PHASED.9913.UMAG2.ENSEMBL108.25.bcf]
  * Output LOG    : [TEST007.25.4.32.Rare.log]

Parameters:
  * Seed    : 15052011
  * Threads : 32 threads
  * PBWT    : [depth = 2,2 / modulo = 0.1 / mac = 2 / mdr = 0.1]
  * HMM     : [Ne = 1500 / Recombination rates given by genetic map]

Parsing specified genomic regions
  * Scaffold region  [25:19885090-25996793]
  * Input region  [25:20385092-25385099]

Reading genotype data
  * BCF scanning (17.22s)
      + Variant sites: [#scaffold=198884 | #rare=182603 | #all=381487]
      + 39496 rare variants in buffer regions excluded
      + 198884 variants found in both files [priority to scaffold]
      + 62304 multi-allelic variants excluded
  * HAP allocation [#scaffold=198884 / #samples=6320] (0.00s)
  * Genotype set allocation [#rare=182603 / #samples=6320] (0.02s)
  * Plain VCF/BCF parsing (35.08s)
      + Scaffold : [0/0=1122420439 0/1=65933761 1/1=68592680]
      + Rare     : [0/0=1143404645 0/1=793882 1/1=178000 ./.=9674433]

Initializing sparse genotypes
  * 174774 missing genotypes imputed at monomorphic sites
  * Genotype set transpose V2I (0.09s)

Setting up genetic map
  * GMAP parsing [n=1165] (0.22s)
  * cM interpolation [s=138 / i=381349] (0.02s)
  * Region length [6111615 bp / 11.1 cM]
  * HMM parameters [Ne=1500 / Error=0.0001]
  * V2H transpose (1.62s)

PBWT pass
  * PBWT initialization [#eval=198884 / #select=112] (0.01s)
  * IBD2 constraints [#inds=22 / #pairs=22] (15.05s)
  * PBWT forward selection (5.87s)
  * PBWT backward selection (6.06s)
      + #states=388.05+/-203.65
      + #collisions = 1604575 / #pushes = 4058727 / rate = 71.67%

HMM computations

The full error printed to screen but not the log is:

phase_rare_static: src/containers/genotype_set/genotype_set_phasing.cpp:66: void genotype_set::phaseLiAndStephens(unsigned int, unsigned int, aligned_vector32<float>&, aligned_vector32<float>&, std::vector<unsigned int>&, float): Assertion `!isnan(GRvar_genotypes[vr][tidx].prob)' failed.
Aborted (core dumped)

I'm in the process of normalizing these files to remove the multiallelic issue but any other suggestions for things to look into would be appreciated. Bob

odelaneau commented 1 year ago

Hi Bob,

I'd be very interested in digging this further. Any chance you can share with me some dataset I can use to debug? You can email me directly (check verbose header).

Best

schnabelr commented 1 year ago

Updating this for the record and anyone that might stumble across this. Initially I thought the issue might be that there were multiallelic sites in the file. This turns out to not be the issue. I removed (properly this time) all multiallelic sites and sites with a missing ALT allele ALT='.' due to removing samples upstream and was able to reproduce the issue. It also does not appear to be due to threading as I can reproduce it with 32, 16, 8, 4 and 1 thread. Direct email sent regarding test data for debugging.

LindoNkambule commented 1 year ago

I have also encountered the same error with phase_rare_static using the chunk files for chr22 provided under the resources/chunks/b38/4cM/ directory. However, in my case, I only got the error when the # of variants found in both files [priority to scaffold] was < 100,000. When the number was > 100,000, everything ran without any issues. I will try increasing the chunk size to 6/7cM and see how it goes.

UPDATE Increasing chunk size to 6cM fixed the issue for me.

UPDATE 2 Increasing the chunk size only worked on the smaller chromosomes. I'm still getting the same error for the larger chromosomes like chr1 even when using 6cM chunk size.

BEFH commented 1 year ago

I am getting this issue with human data as well. @LindoNkambule, if you figure this out, could you please let me know and also share the code for generating the chunk ranges?

Thanks.

edit: This doesn't seem to be the case for me about the 100k on chr22. 113825 variants found in both files [priority to scaffold]

odelaneau commented 1 year ago

Thanks for all the comments and testing. I'll look at the example dataset of schnabelr and let you know guys as soon as I understand the problem.

odelaneau commented 1 year ago

One possible explanation for this problem is the lack of QC on the sequencing data prior to phasing.

Before running phase_common and phase_rare, I strongly recommend:

You can also use any other standard QC procedures that are usually performed on genotype data.

Of note, you should also use bcftools norm to transform multi-allelic variants into bi-allelic variants.

BEFH commented 1 year ago

testing for Hardy Weinberg Equilibrium and excess of heterozygosity

This is extremely multi-ancestry data with a fair bit of admixture, so looking at heterozygosity through other measures than HWE is problematic. HWE does test for excess heterozygosity, and RUTH HWE does distinguish between depleted and excess heterozygosity. However, RUTH takes a long time to run on large datasets due to the PCA and the multiple filtering steps.

excluding variants in regions with low mappability

The data are already bi-allelic, and many problematic variants are already removed based on VQSR, read depth, and other quality metrics.

edit: The data weren't as filtered as I had thought. I did filtering based on callrate and VQSR, but not HWE/het or mappability. It's now getting through, but erroring at the end with e.g. [W::bcf_record_check] Bad BCF record at 37198707: Invalid CONTIG id -1. I am currently troubleshooting that.

BEFH commented 1 year ago

I have opened a new issue (#34) with my new, possibly unrelated problem.

schnabelr commented 1 year ago

UPDATE TL;DR Doing proper QC upfront fixed my issue, at least for the first two chunks I've been testing this on. As someone who peaches about QC, I'm embarrassed I missed this.

@odelaneau Perhaps you can do some checking while loading data and if the input data shows some of the issues that trigger this error you can print a big warning message to the screen/log. A couple of obvious ones would be multiallelic data with a suggestion to run bcftools norm. If you encounter samples with high missing data rate then a suggestion to remove low call rate samples and optimally print the sample IDs. If you encounter a site(s) with low call rate then suggest filtering sites on call rate, etc. Then if possible, add a more informative error message when the error in question happens.

For those that find this bug report, below is how I fixed my issue, your mileage may vary.

bcftools view --threads $NormCPU -f PASS -e 'F_MISSING > 0.10' -Ou $InputBCF | bcftools norm --threads $NormCPU --fasta-ref $RefFasta --multiallelics -any -Ou | bcftools +mendelian -m d --trio-file $PED --rules-file $RulesFile -Ou | bcftools view -e 'ALT="." || ALT="*"' -Ou | bcftools view -e 'F_MISSING > 0.10' -Ob -o $NormOutputBCF

My $InputBCF is a typical bcf after GATK VQSR that has all variants, pass and otherwise. I also added the bcftools +mendelian step to remove any errors that are present for the trios in the data. (I really wish it did duos but apparently it doesn't) In the above command, I'm using 1 thread for everything because the +mendelian takes longer than everything upstream and ends up being the bottleneck, so it does no good to give the first view more than 1 thread nor anything after it. If you don't have the +mendelian step then you might evaluate giving the first and last view another thread for (de)compression. The other comment I'll make is that if you do any subsetting with bcftools view -S or -^S make sure you read the fine manual carefully about the order of operations. This is the second time this has bitten me and how I ended up with the commands above that appear to work properly, as opposed to combining operations in one step.

odelaneau commented 1 year ago

That's a good plan. I'll add soon a new tool that checks a few things on the data prior to phasing.

schnabelr commented 1 year ago

@odelaneau I was about to close this but this error has appeared again and I haven't been able to track down the reason for it. I did all the QC on the input data and everything has been working fine. What I am doing now is running multiple iterations (3 in the present case) of the same parameter sets to evaluate the variability between runs (among other things). I do the normal phase common, ligate, phase rare, cat rare but I do it 3 separate times. All of the phase common chunks/runs work fine and the ligate works fine. However, I am randomly getting one or more of the phase rare chunks that will fail.

I did some poking around to try to see if something was different between the two scaffold files. Below, $FailedInput and $SuccessInput are just paths to the scaffold bcf from two independent runs of the same parameters with the first failing at phase rare and the second completing phase rare successfully, for the same chunk.

# Command line "* Version       : 5.1.1 / commit = 990ed0d / release = 2023-05-08"
phase_rare_static --input $CleanFullBCF --scaffold $FailedInput --output $OutputBCF --log $Log --scaffold-region 25:36281024-42350435 --input-region 25:36823909-42350435 --map $Map --thread 32 --effective-size 1500 --pbwt-mac 2 --pbwt-modulo 0.1 --pbwt-depth-common 4 --pbwt-depth-rare 2
# ERROR
phase_rare_static: src/containers/genotype_set/genotype_set_phasing.cpp:66: void genotype_set::phaseLiAndStephens(unsigned int, unsigned int, aligned_vector32<float>&, aligned_vector32<float>&, std::vector<unsigned int>&, float): Assertion `!isnan(GRvar_genotypes[vr][tidx].prob)' failed.
(core dumped) 

# Q: How many variants in each file? A: same
bcftools view -HG $FailedInput | wc -l
1895887
bcftools view -HG $SuccessInput | wc -l
1895887

# Q: Are there any differences in the actual variants represented? A: same
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' $FailedInput >ScaffoldLoci.failed.txt
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' $SuccessInput >ScaffoldLoci.success.txt
cat ScaffoldLoci.failed.txt ScaffoldLoci.success.txt | sort | uniq -c | awk '{if($1==1) print $0}' | wc -l
0

# Q: Are there any monomorphic loci? A: No
bcftools view -HG -i 'MAF=0' $FailedInput | wc -l
0
bcftools view -HG -i 'MAF=0' $SuccessInput | wc -l
0

I also had a look at bcftools +smpl-stats -r 25:36281024-42350435 and I didn't see anything that looked out of the ordinary. At this point, I'm at a loss to figure out what is causing some of these to fail. There is clearly something in the input data that it doesn't like. Previously it was because I didn't do QC, however, I have fixed that. Now, given the exact same input data and run multiple times, there is something associated with the output from phase_common and/or ligate where some runs succeed and some fail. There seems to be some rare corner case that is triggering this error. In plain language, what exactly is causing this error? (I can't really read C++ well) If I am able to understood why this error gets triggered, then I will have a better chance of tracking down the culprit in the data and fixing and/or testing for it upstream. Thanks, Bob

odelaneau commented 1 year ago

Okay, I would like to investigate this one a bit further. Can you email me and share some data with me? SO that I can reproduce this error? My email is in the S5 verbose header.

schnabelr commented 1 year ago

I'm in the process of generating the data for you to test this. Since this appears to be a rare corner case, but it it reproducible once it occurs, I decided to try changing the seed. Interestingly, changing the seed allowed this to complete successfully. Hopefully that will help tracking down where/how this is occurring.

justinjj24 commented 7 months ago

I'm in the process of generating the data for you to test this. Since this appears to be a rare corner case, but it it reproducible once it occurs, I decided to try changing the seed. Interestingly, changing the seed allowed this to complete successfully. Hopefully that will help tracking down where/how this is occurring.

@schnabelr can you tell what's seed value (--seed arg (=15052011) Seed of the random number generator) and (--window-cm arg (=4) Minimal window size in cM, --window-mb arg (=4) Minimal window size in Mb) you tried to overcome the failure?

LindoNkambule commented 1 month ago

Hi @odelaneau

I'm re-opening this issue as I would like to get your thoughts on how to proceed when creating a reference panel and some of the recommended filters result in a large number of common variants beings dropped.

Background: we've jointly called HGDP+1kGP datasets and released the phased, using SHAPEIT5, haplotypes. We've however gotten a few inquiries about the number of common variants previously found in other GWAS that are not in our reference panel. Most of these variants were removed by the HWE filter (p<1e-30) that was recommended as we were getting Assertion GRvar_genotypes[vr][tidx].prob >= 0.0f failed error when phasing rare variants. We have tried the following HWE filtering strategies but none of them worked:

Do you think it is possible keep these variants that are getting filtered out by the HWE filter without running into the issue when phasing rare variants?

Thanks, Lindo, cc @taboltz