Closed nh13 closed 6 years ago
@nh13 Hi Nils,
Thanks for sharing these details. Looking at the parameters printed during each iteration, it looks like you've uncovered a bug in the EM procedure. Each of the population frequencies should be between 0 and 1, but there's some crazy numbers like 1.79769e+308 and 3071.9. In addition to this being an obvious bug, I suspect this issue is likely playing a role in the slow convergence.
Could you send me the required files to reproduce this issue? I'm optimistic that if I can debug on my end, it'll resolve the bug and the convergence issues. If the convergence issues remain after the bug fix, I think it makes sense to relax the stringency of the convergence criteria.
Thanks! Thomas
@tfwillems I have send you some tests cases for this issue via email. Please let me know if you did not receive them.
I also forgot to say that I saw a lot of very small probabilities, and so you may want to look at using log1p instead of log
and expm1
instead of exp
as they sometimes are more precise. A good reference for such implementations is in this R note. We found this helpful to implement in fgbio (see here).
Great @nh13, I'll dig into these when I get and chance and see whether the above suggestions are also helpful
Hi @nh13, I took a look at the examples you sent. I previously misinterpreted the logging output you sent my way. It turns out there was no bug, as each iteration two lines were logged and the second line doesn't pertain to the EM parameters and indicates the change in likelihood. So to make a long story short, it just looks like the likelihood at these loci changes very slowly, even though the stutter model parameters don't really change.
To address these cases, I've modified the EM procedure such that it triggers convergence if the parameters haven't changed since the last iteration. The resulting changes were added in commit df53cb03943c0dd77992055da9c08807b9f45f8b. This fixes the cases you sent and hopefully no future cases will arise.
Best, Thomas
I have in interesting case where the stutter model fails to converge. My goal is to force STR to genotype these sites, as I have a few false-negatives that have great coverage.
After increasing the values for
ABS_LL_CONVERGE
andFRAC_LL_CONVERGE
in genotyper_bam_processor, it correctly genotypes the locus.Would these be candidates for command line options or a configuration file?