Low hom_ref recall during model training for new organism

DanJeffries commented 2 weeks ago

Dear Devs,

I am currently training a model (starting from wgs.1.6.1) for use in a fish species. The programs are running well, I have confident regions and truth variants defined, and am currently tuning hyperparameters to optimise the training.

However . . . . I notice when tracking the model eval stats (specifically f1, precision, recall), that the hom_ref classifications are much less reliable than hom_alt and het classes. My question is whether this is to be expected, or whether there might be something wrong with my training setup, or perhaps the examples.

The test example set I am using to tune the hyperparams looks like this:

# Generated by shuffle_tfrecords_beam.py
# class2: 89987
# class0: 33161
# class1: 24300

name: "Shuffle_global"
tfrecord_path: "/home/examples_shuffled/train/shuf_test/examples_shuf3_testset.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 147448

The training command looks like this:

LR=0.001
BS=1024

apptainer run \
--nv \
-B $WD:/home \
$DV_PATH \
/opt/deepvariant/bin/train \
--config=/home/dv_config.py:base \
--config.train_dataset_pbtxt="/home/examples_shuffled/train/shuf_test/examples_shuf3_testset_config.pbtxt" \
--config.tune_dataset_pbtxt="/home/examples_shuffled/tune_test/tune_test_examples_config.pbtxt" \
--config.num_epochs=1 \
--config.learning_rate=${LR} \
--config.num_validation_examples=0 \
--config.tune_every_steps=2000 \
--experiment_dir=/home/${OUTDIR} \
--strategy=mirrored \
--config.batch_size=${BS} \
--config.init_checkpoint="/home/model_wgs_v1.6.1/deepvariant.wgs.ckpt"

During other tests I have run training jobs with several other example sets (several times larger), for tens of thousands of steps and multiple epochs, and also using different learning rates and batch sizes. While these things of course make a difference to learning performance, the lower recall for class 0 (hom_ref) remains consistent.

Here are some lines from the log file during one such training run:

I1031 10:55:27.365902 140558597089024 logging_writer.py:48] [0] epoch=0, train/categorical_accuracy=0.91796875, train/categorical_crossentropy=0.6384725570678711, train/f1_het=0.7428571581840515, train/f1_homalt=0.964401364326477, train/f1_homref=0.902255654335022, train/f1_macro=0.
8698380589485168, train/f1_micro=0.91796875, train/f1_weighted=0.9241795539855957, train/false_negatives=34.0, train/false_positives=14.0, train/learning_rate=9.999999747378752e-06, train/loss=0.6384731531143188, train/precision=0.9406779408454895, train/precision_het=0.702702701091
7664, train/precision_homalt=0.978723406791687, train/precision_homref=1.0, train/recall=0.8671875, train/recall_het=1.0, train/recall_homalt=0.8789808750152588, train/recall_homref=0.7945205569267273, train/true_negatives=498.0, train/true_positives=222.0
I1031 11:18:53.873582 140558597089024 logging_writer.py:48] [100] epoch=0, train/categorical_accuracy=0.9428125023841858, train/categorical_crossentropy=0.6131614446640015, train/f1_het=0.8813706636428833, train/f1_homalt=0.9773345589637756, train/f1_homref=0.9072573781013489, train
/f1_macro=0.9219875335693359, train/f1_micro=0.9428125023841858, train/f1_weighted=0.943286657333374, train/false_negatives=2097.0, train/false_positives=1038.0, train/learning_rate=1.089999932446517e-05, train/loss=0.613162100315094, train/precision=0.9577034115791321, train/precis
ion_het=0.8474469184875488, train/precision_homalt=0.9807273149490356, train/precision_homref=0.9931716322898865, train/recall=0.9180859327316284, train/recall_het=0.9800729155540466, train/recall_homalt=0.9496662616729736, train/recall_homref=0.8124356865882874, train/true_negative
s=50162.0, train/true_positives=23503.0
I1031 11:42:21.111832 140558597089024 logging_writer.py:48] [200] epoch=0, train/categorical_accuracy=0.9450390338897705, train/categorical_crossentropy=0.6096971035003662, train/f1_het=0.9036118984222412, train/f1_homalt=0.9778183698654175, train/f1_homref=0.9041261672973633, train
/f1_macro=0.9285187721252441, train/f1_micro=0.9450390338897705, train/f1_weighted=0.9446096420288086, train/false_negatives=1829.0, train/false_positives=1089.0, train/learning_rate=1.1799999811046291e-05, train/loss=0.6096978187561035, train/precision=0.9561946988105774, train/pre
cision_het=0.8650889992713928, train/precision_homalt=0.9727717638015747, train/precision_homref=0.9934629201889038, train/recall=0.9285547137260437, train/recall_het=0.9818640947341919, train/recall_homalt=0.973392903804779, train/recall_homref=0.807692289352417, train/true_negativ
es=50111.0, train/true_positives=23771.0
I1031 12:05:48.379206 140558597089024 logging_writer.py:48] [300] epoch=0, train/categorical_accuracy=0.9505468606948853, train/categorical_crossentropy=0.6057209968566895, train/f1_het=0.9090908765792847, train/f1_homalt=0.9804965853691101, train/f1_homref=0.9048975706100464, train
/f1_macro=0.9314950108528137, train/f1_micro=0.9505468606948853, train/f1_weighted=0.9502355456352234, train/false_negatives=1770.0, train/false_positives=957.0, train/learning_rate=1.269999938813271e-05, train/loss=0.6057215332984924, train/precision=0.961391031742096, train/precis
ion_het=0.8758693337440491, train/precision_homalt=0.9790242314338684, train/precision_homref=0.9892578125, train/recall=0.930859386920929, train/recall_het=0.9778823256492615, train/recall_homalt=0.9663954377174377, train/recall_homref=0.8126103281974792, train/true_negatives=50243
.0, train/true_positives=23830.0
...
I1031 18:45:29.237352 140558597089024 logging_writer.py:48] [2000] epoch=0, train/categorical_accuracy=0.9649609327316284, train/categorical_crossentropy=0.5881873965263367, train/f1_het=0.9483578205108643, train/f1_homalt=0.9847753643989563, train/f1_homref=0.9196805953979492, trai
n/f1_macro=0.9509379267692566, train/f1_micro=0.9649609327316284, train/f1_weighted=0.9644672274589539, train/false_negatives=1148.0, train/false_positives=706.0, train/learning_rate=2.7999998565064743e-05, train/loss=0.588188111782074, train/precision=0.971937358379364, train/preci
sion_het=0.944876492023468, train/precision_homalt=0.9782115817070007, train/precision_homref=0.9757440090179443, train/recall=0.9551562666893005, train/recall_het=0.9551445245742798, train/recall_homalt=0.9889228343963623, train/recall_homref=0.8598886132240295, train/true_negative
s=50494.0, train/true_positives=24452.0
I1031 19:09:02.961704 140558597089024 logging_writer.py:48] [2100] epoch=0, train/categorical_accuracy=0.9671875238418579, train/categorical_crossentropy=0.5857771635055542, train/f1_het=0.9489831328392029, train/f1_homalt=0.9863211512565613, train/f1_homref=0.9253288507461548, trai
n/f1_macro=0.9535443782806396, train/f1_micro=0.9671875238418579, train/f1_weighted=0.9668003916740417, train/false_negatives=1088.0, train/false_positives=626.0, train/learning_rate=2.8899999961140566e-05, train/loss=0.5857776999473572, train/precision=0.9750974774360657, train/pre
cision_het=0.9460923075675964, train/precision_homalt=0.9813895225524902, train/precision_homref=0.9797391891479492, train/recall=0.9574999809265137, train/recall_het=0.9564493298530579, train/recall_homalt=0.9893515706062317, train/recall_homref=0.8688845634460449, train/true_negat
ives=50574.0, train/true_positives=24512.0
I1031 19:32:36.750034 140558597089024 logging_writer.py:48] [2200] epoch=0, train/categorical_accuracy=0.9658593535423279, train/categorical_crossentropy=0.5868217349052429, train/f1_het=0.9465685486793518, train/f1_homalt=0.9850433468818665, train/f1_homref=0.9267298579216003, trai
n/f1_macro=0.952780544757843, train/f1_micro=0.9658593535423279, train/f1_weighted=0.9654529094696045, train/false_negatives=1091.0, train/false_positives=686.0, train/learning_rate=2.9799999538226984e-05, train/loss=0.5868225693702698, train/precision=0.9727723598480225, train/prec
ision_het=0.9420636892318726, train/precision_homalt=0.9787166714668274, train/precision_homref=0.9793538451194763, train/recall=0.9573827981948853, train/recall_het=0.9550970792770386, train/recall_homalt=0.9905757308006287, train/recall_homref=0.8709622621536255, train/true_negati
ves=50514.0, train/true_positives=24509.0
...
I1101 03:23:30.847490 140558597089024 logging_writer.py:48] [4000] epoch=0, train/categorical_accuracy=0.9678124785423279, train/categorical_crossentropy=0.5838660597801208, train/f1_het=0.9544981122016907, train/f1_homalt=0.9849750399589539, train/f1_homref=0.9294350743293762, trai
n/f1_macro=0.9563027024269104, train/f1_micro=0.9678124785423279, train/f1_weighted=0.9675065279006958, train/false_negatives=1036.0, train/false_positives=650.0, train/learning_rate=4.5999997382750735e-05, train/loss=0.5838666558265686, train/precision=0.9742206931114197, train/pre
cision_het=0.964268684387207, train/precision_homalt=0.9791640043258667, train/precision_homref=0.9672790169715881, train/recall=0.9595312476158142, train/recall_het=0.936170220375061, train/recall_homalt=0.9912803769111633, train/recall_homref=0.8901215195655823, train/true_negativ
es=50550.0, train/true_positives=24564.0
I1101 03:47:04.138323 140558597089024 logging_writer.py:48] [4100] epoch=0, train/categorical_accuracy=0.9671093821525574, train/categorical_crossentropy=0.5843069553375244, train/f1_het=0.9511468410491943, train/f1_homalt=0.9851624965667725, train/f1_homref=0.9291577339172363, trai
n/f1_macro=0.9551556706428528, train/f1_micro=0.9671093821525574, train/f1_weighted=0.9667912721633911, train/false_negatives=1039.0, train/false_positives=662.0, train/learning_rate=4.690000059781596e-05, train/loss=0.5843074917793274, train/precision=0.9737541079521179, train/prec
ision_het=0.9606503248214722, train/precision_homalt=0.9785905480384827, train/precision_homref=0.9699148535728455, train/recall=0.9594140648841858, train/recall_het=0.934234619140625, train/recall_homalt=0.991542398929596, train/recall_homref=0.891943633556366, train/true_negatives
=50538.0, train/true_positives=24561.0
I1101 04:10:37.281655 140558597089024 logging_writer.py:48] [4200] epoch=0, train/categorical_accuracy=0.9688281416893005, train/categorical_crossentropy=0.5830131769180298, train/f1_het=0.9537984728813171, train/f1_homalt=0.9860970973968506, train/f1_homref=0.9311457872390747, trai
n/f1_macro=0.9570137858390808, train/f1_micro=0.9688281416893005, train/f1_weighted=0.9684778451919556, train/false_negatives=988.0, train/false_positives=644.0, train/learning_rate=4.780000017490238e-05, train/loss=0.5830137133598328, train/precision=0.9745011329650879, train/preci
sion_het=0.9634146094322205, train/precision_homalt=0.9787381291389465, train/precision_homref=0.9703365564346313, train/recall=0.9614062309265137, train/recall_het=0.9402523040771484, train/recall_homalt=0.9935504198074341, train/recall_homref=0.8891792893409729, train/true_negativ
es=50556.0, train/true_positives=24612.0
I1101 04:34:10.308057 140558597089024 logging_writer.py:48] [4300] epoch=0, train/categorical_accuracy=0.9663281440734863, train/categorical_crossentropy=0.5852743983268738, train/f1_het=0.9510709643363953, train/f1_homalt=0.98508220911026, train/f1_homref=0.9242424964904785, train/
f1_macro=0.9534652233123779, train/f1_micro=0.9663281440734863, train/f1_weighted=0.9659736752510071, train/false_negatives=1079.0, train/false_positives=673.0, train/learning_rate=4.86999997519888e-05, train/loss=0.5852747559547424, train/precision=0.9732872843742371, train/precisi
on_het=0.9643771648406982, train/precision_homalt=0.9781907796859741, train/precision_homref=0.9655511975288391, train/recall=0.9578515887260437, train/recall_het=0.9316239356994629, train/recall_homalt=0.9919866919517517, train/recall_homref=0.8829882740974426, train/true_negatives
=50527.0, train/true_positives=24521.0

Thanks in advance for your help!

Dan

pichuan commented 2 weeks ago

Hi @DanJeffries , two questions for you:

1. Have you already trained a model to the end, and completed a variant calling + hap.py evaluation? At the end of the day, you'll want to know the variant calling accuracy against a metrics that makes most sense to you. If your truth VCF and confident regions is high quality, I'd suggest directly using hap.py to evaluate. Or, if you want to use other metrics such as Mendelian violation rate as a check, that might be a good idea too. The class0 (hom_ref) formulation is internal to DeepVariant, and might not be the best way to tell how the training has been.

2. How good is your truth set and confident regions? One possibility is: In your confident regions, if there are many real variants where you don't have in your truth VCF, this might cause DeepVariant to call them as variants, but then could contribute to lower "recall" of homrefs. If you know your truth VCF and confident regions is high quality, then this should not be the issue.

DanJeffries commented 2 weeks ago

Hi @pichuan ,

Thanks for the quick response! Regarding your questions:

I have completed training and run some test calls, though this was just to make sure the models were vaguely sensible, I ran hap.py but didn't spend too much time evaluating the results because I am not yet finished optimising the training. But I take your point that real-world metrics will be more useful than the internal stats.
As we are not working in a model organism, our truth set is unlikely to be of the same quality of, say, humans. Though my hope is that it is good enough. We defined confident regions and truth variants via some relatively strict alignment quality filters plus mendelian segregation patterns (the training data are from 5 trios). But no further validation, so I certainly think it is possible that there are real variants in the confident regions that are not in the truth VCF. I can see how this would lower hom_ref recall so I will explore this and see how many might be there.

I'll post back here once I have explored these points further.

Thanks!

Dan

pichuan commented 2 weeks ago

Thank you for your update! I'm very curious to see how this goes. Let me know if there's anything else we can help with.

google / deepvariant

Low hom_ref recall during model training for new organism #904