Open DanJeffries opened 2 weeks ago
Hi @DanJeffries , two questions for you:
1. Have you already trained a model to the end, and completed a variant calling + hap.py evaluation? At the end of the day, you'll want to know the variant calling accuracy against a metrics that makes most sense to you. If your truth VCF and confident regions is high quality, I'd suggest directly using hap.py to evaluate. Or, if you want to use other metrics such as Mendelian violation rate as a check, that might be a good idea too. The class0 (hom_ref) formulation is internal to DeepVariant, and might not be the best way to tell how the training has been.
2. How good is your truth set and confident regions? One possibility is: In your confident regions, if there are many real variants where you don't have in your truth VCF, this might cause DeepVariant to call them as variants, but then could contribute to lower "recall" of homrefs. If you know your truth VCF and confident regions is high quality, then this should not be the issue.
Hi @pichuan ,
Thanks for the quick response! Regarding your questions:
I have completed training and run some test calls, though this was just to make sure the models were vaguely sensible, I ran hap.py but didn't spend too much time evaluating the results because I am not yet finished optimising the training. But I take your point that real-world metrics will be more useful than the internal stats.
As we are not working in a model organism, our truth set is unlikely to be of the same quality of, say, humans. Though my hope is that it is good enough. We defined confident regions and truth variants via some relatively strict alignment quality filters plus mendelian segregation patterns (the training data are from 5 trios). But no further validation, so I certainly think it is possible that there are real variants in the confident regions that are not in the truth VCF. I can see how this would lower hom_ref recall so I will explore this and see how many might be there.
I'll post back here once I have explored these points further.
Thanks!
Dan
Thank you for your update! I'm very curious to see how this goes. Let me know if there's anything else we can help with.
Dear Devs,
I am currently training a model (starting from wgs.1.6.1) for use in a fish species. The programs are running well, I have confident regions and truth variants defined, and am currently tuning hyperparameters to optimise the training.
However . . . . I notice when tracking the model eval stats (specifically f1, precision, recall), that the hom_ref classifications are much less reliable than hom_alt and het classes. My question is whether this is to be expected, or whether there might be something wrong with my training setup, or perhaps the examples.
The test example set I am using to tune the hyperparams looks like this:
The training command looks like this:
During other tests I have run training jobs with several other example sets (several times larger), for tens of thousands of steps and multiple epochs, and also using different learning rates and batch sizes. While these things of course make a difference to learning performance, the lower recall for class 0 (hom_ref) remains consistent.
Here are some lines from the log file during one such training run:
Thanks in advance for your help!
Dan