rwdavies / QUILT

GNU General Public License v3.0
45 stars 10 forks source link

Incorrect allele ("2") #3

Closed jasperlinthorst closed 2 years ago

jasperlinthorst commented 2 years ago

Hi Robert, I'm playing with QUILT to impute some low coverage samples using the HRC reference data.

With version 0.1.9 I run into the problem that some variant records contain incorrect GT fields, for example:

chr22 18173093 . G A . PASS EAF=1.411;INFO_SCORE=0.001;HWE=1;ERC=0.99992;EAC=3e-05;PAF=0.99997 GT:GP:DS:HD 2|0:0.169,-1.16,1.991:2.822:1.575,0.407

I guess, it should be 1|1, but it now makes a reference to allele '2' which doesn't exist.

With version 0.1.8 the output is not phased, but all GT fields are correct.

chr22 18173093 . G A . PASS EAF=1.422;INFO_SCORE=0;HWE=1;ERC=0.99992;EAC=3e-05;PAF=0.99997 GT:GP:DS:HD 1/1:0.169,-1.182,2.013:2.844:1.602,1.296

I hope it's just a small bug with respect to how the phased output is written, but if I'm misinterpreting things please let me know!

Kind regards, Jasper

rwdavies commented 2 years ago

Hi Jasper,

Thanks for the message. Somethings gone quite wrong here (on my end almost certainly), the GP is giving out of bound values (should be between 0 and 1 each and sum to 1). I suspect it’s something to do with the forward backward algorithm for the full run but I’m surprised I haven’t caught all those bugs yet through usage or tests. Beyond the GT values, looking at the GP value not summing to 1 and/or with negative values, is this issue prevalent throughout the file (other SNPs)? Is this just 1 sample with this problem or many? Does it reproduce if you rerun? What’s the sample coverage? Did you change any other settings beyond the obvious ones to give paths?

Thanks, Robbie

jasperlinthorst commented 2 years ago

Hi Robbie, thanks for your quick reply! Yes, I also saw this problem with other imputed SNPs and throughout different samples. I also noted the negative GP values, but I also noted that, while hard to interpret as a genotype probability, they do always add up to 1. Re-running yields slightly different results everytime of course, but throughout the entire file the problem remains. The sample coverages are mostly between 0.1x and 1x. I don't really use any special combination of options, but I dropped the genetic_map_file, posfile and phasefile options and added the output_filename. I do use the QUILT_prepare_reference setup, after which I impute all samples individually, so the bamlist just contains 1 sample everytime.

I'll do some more testing today, I'll let you know if I can figure out what goes wrong...

rwdavies commented 2 years ago

OK thanks for this, I'm going to dig into this, see if there are obvious things on my end, try running it through a lot of samples to see if I broke something recently, etc. Thanks for bringing this to my attention

rwdavies commented 2 years ago

One thing that might be useful, you can throw in a check right after here https://github.com/rwdavies/QUILT/blob/master/QUILT/R/functions.R#L1647 and then re-build and re-install using ./scripts/build-and-install/R that dosage has range between 0 and 1, and throw an error otherwise (something like the following)

        if ((min(dosage) < -1e-5) | (1 + 1e-5) < max(dosage)) {
            print(range(dosage))
            stop("Dosage out of bounds on forward-backward full iteration")
        }

I'll probably add that into master once I can figure out how to print better error messaging should that situation occur. Assuming that throws an error, something has indeed gone wrong with the full forward-backwards HMM

rwdavies commented 2 years ago

I think I have a more subtle version of this captured, I'll take a look and report back

jasperlinthorst commented 2 years ago

Thanks, I had to double check that the versions of STITCH and QUILT with which I prepared the reference etc. were all up-to-date before diving into the code. But that indeed seems to all be correct. Good that you managed to reproduce it... hope to hear from you soon.

On Thu, Jul 29, 2021 at 3:19 PM rwdavies @.***> wrote:

I think I have a more subtle version of this captured, I'll take a look and report back

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rwdavies/QUILT/issues/3#issuecomment-889136974, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7ZZXWXOK6AYLGAD3SDATT2FIMPANCNFSM5BCJFGNA .

rwdavies commented 2 years ago

I've found at least one problem that corrects my local problem. I'm hopeful that single bug explains it (it removes my bug), but I want to understand why my tests didn't catch this - something wrong or insufficient tests.

If you're curious, internally, I represent the reference haplotypes (0's and 1's) 32 as a time (so as an int). Because they're often the same across 50000 haplotypes, using nMaxDH (maximum number of distinct haplotypes, usually set around~250) most common and store those, in compressed and uncompressed format (the 250 of the,), and then the reference haplotypes for that 32 SNP are stored as an int of 1:nMaxDH. Now for some of those reference haplotypes, that won't work, as there are more than nMaxDH distinct reference haplotypes over a span of 32 SNPs, so those are analyzed separately in the code. For the paper, nMaxDH was set pretty high (the ~250), so there were very few haplotypes in the reference not captured, but I've since made it determined by QUILT and often use a lower value, so it runs faster. This error was in the HMM for the special case - I use a lot of speedups e.g. delaying multiplications in the C++ code to minimize the number of operations / performs operations more efficiently, but I forgot to port one of them over to the special cases. Hence, I think, why I didn't notice it during the paper but it's coming up pretty often for you now.

Anyway I hope that's it, I will investigate some more and get back once resolved. Thanks again for pointing this out.

rwdavies commented 2 years ago

OK I found another similar small bug in a similar vein, again only for special cases. These were not caught in the tests which weren't expansive enough (they catch it now). I'm going to re-run a bunch of samples and see if the issue comes up again. I've pushed what I think are the right changes with my last commit, so it hopefully should work as expected now!

rwdavies commented 2 years ago

OK I ran a bunch of jobs last night and they worked without issue. Are you able to pull the latest commits, install using ./scripts/build-and-install.R, and re-try your imputation jobs, and see if the issue is solved? You don't need to rebuild the reference

jasperlinthorst commented 2 years ago

Yes, i pulled the latest changes and am running it at the moment on our dataset. First 1000 no problem, so I think it indeed is fixed now, thanks a lot! I’ll let it run over the weekend and let you know.

On Fri, 30 Jul 2021 at 13:53, rwdavies @.***> wrote:

OK I ran a bunch of jobs last night and they worked without issue. Are you able to pull the latest commits, install using ./scripts/build-and-install.R, and re-try your imputation jobs, and see if the issue is solved? You don't need to rebuild the reference

— You are receiving this because you authored the thread.

Reply to this email directly, view it on GitHub https://github.com/rwdavies/QUILT/issues/3#issuecomment-889841337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA7ZZS5EL4DZLAJKPAY35LT2KHCFANCNFSM5BCJFGNA .

rwdavies commented 2 years ago

OK thanks, I'll wait to hear from you, then if all goes well Monday I'll push a version change.

jasperlinthorst commented 2 years ago

All seems to work fine now, so I'll close the issue. A new release would be appreciated as well. Thanks a lot for your help! Jasper

rwdavies commented 2 years ago

Thanks I greatly appreciate it, I'll make a new release today or tomorrow