google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.25k stars 728 forks source link

No evaluation statistics for het and homalt calls during model training #876

Closed DanJeffries closed 2 months ago

DanJeffries commented 2 months ago

Hello DV team, and thanks for creating such a great tool!

I am currently trying to retrain the wgs model for a new species (a fish) however, during training, I see no evaluation statistics (precision, recall, f1) for either het or homalt. Or more specifically they are all 0.0. Eval stats are reported for homref though. I have now tried running the training several times with different hyperparameters but so far still no change at the het or homalt eval stats.

My first, very simple question is thus, are these eval stats truly 0 (i.e. the model is very bad) or is 0.0 some starting value and there are not enough data to calculate them initially? I am warmstarting from the 1.6.1 wgs model so I cant imagine the model is really that bad at calling variants initially, even if in a fish.

Setup Running on a university computing cluster (https://hpc-unibe-ch.github.io/) OS: Rocky 9.3 Blue Onyx GPU: rtx4090 Installation: Running from Docker image via singularity DV version: 1.6.1

Data I am training on examples from 5 individuals, data from Illumina NovaSeq ~20x coverage. 17/21 chromosomes used for training (~1.45M examples) 2/21 chromosomes used for tuning (~200k examples) 2/21 chromosomes reserved for testing. (Different chromosomes used for train/tune/test across samples - see below)

Screenshot 2024-08-07 at 09 30 23

Shuffling Performed downsampling=0.5. Shuffled globally across samples, chromosomes and downsampling.

Command

My latest training run was like so:

apptainer run  
--nv  
-B $WD:/home  
$DV_PATH  
/opt/deepvariant/bin/train  
--config=/home/dv_config.py:base  
--config.train_dataset_pbtxt="/home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt"  
--config.tune_dataset_pbtxt="/home/examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt". 
--config.num_epochs=1  
--config.learning_rate=0.0001  
--config.num_validation_examples=0  
--config.tune_every_steps=2000  
--experiment_dir=/home/${OUTDIR}  
--strategy=mirrored  
--config.batch_size=64  
--config.init_checkpoint="/home/model_wgs_v1.6.1/deepvariant.wgs.ckpt"

Though previous runs had higher learning rates (0.01) and batch sizes (128). Training proceeds as follows:

Training Examples: 1454377 Batch Size: 64 Epochs: 1 Steps per epoch: 22724 Steps per tune: 3162 Num train steps: 22724

Log file

Here is the top of the log file, including some warnings in case they are relevant:

/usr/local/lib/python3.8/dist-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning:

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP).

For more information see: https://github.com/tensorflow/addons/issues/2807

  warnings.warn(
2024-08-28 10:40:42.588215: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
I0828 10:40:42.589054 140318776715072 train.py:92] Running with debug=False
I0828 10:40:42.589343 140318776715072 train.py:100] Use TPU at local
I0828 10:40:42.589422 140318776715072 train.py:103] experiment_dir: /home/training_outs/epoch1/
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
W0828 10:40:42.596594 140318776715072 cross_device_ops.py:1387] There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
I0828 10:40:42.658734 140318776715072 mirrored_strategy.py:374] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
/usr/local/lib/python3.8/dist-packages/keras/applications/inception_v3.py:138: UserWarning: This model usually expects 1 or 3 input channels. However, it was passed an input_shape with 7 input channels.
  input_shape = imagenet_utils.obtain_input_shape(
I0828 10:40:47.952382 140318776715072 keras_modeling.py:325] Number of l2 regularizers: 95.
I0828 10:40:48.007193 140318776715072 keras_modeling.py:362] inceptionv3: load_weights from checkpoint: /home/training_outs/epoch1//checkpoints/ckpt-5997
I0828 10:40:49.193293 140318776715072 train.py:191] Exponential Decay: initial_learning_rate=0.0001
 decay_steps=45448
 learning_rate_decay_rate=0.947
I0828 10:40:49.193522 140318776715072 train.py:203] Use LinearWarmup:
 warmup_steps=10000
 warmup_learning_rate=1e-05
I0828 10:40:49.401860 140318776715072 keras_modeling.py:472] Restored checkpoint ckpt-5997 at step=0. tune/f1_weighted=tf.Tensor(0.0, shape=(), dtype=float32)
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
W0828 10:40:49.488072 140318776715072 deprecation.py:350] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be rem>
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
2024-08-28 10:40:49.893797: W tensorflow/core/framework/dataset.cc:769] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
I0828 10:40:49.947995 140318776715072 train.py:316]

And here is an excerpt of from a later portion of the log file including some training and tuning steps, where you can see the 0.0 for het and homalt eval stats.

I0829 07:13:59.098341 140305134778112 logging_writer.py:48] [13700] epoch=0, train/categorical_accuracy=1.0, train/categorical_crossentropy=0.552009105682373, train/f1_het=0.0, train/f1_homalt=0.0, train/f1_homref=1.0, train/f1_macro=0.3333333432674408, train/f1_micro=1.0, train/f1_
weighted=1.0, train/false_negatives=0.0, train/false_positives=0.0, train/learning_rate=9.999999747378752e-05, train/loss=0.5520098805427551, train/precision=1.0, train/precision_het=0.0, train/precision_homalt=0.0, train/precision_homref=1.0, train/recall=1.0, train/recall_het=0.0,
 train/recall_homalt=0.0, train/recall_homref=1.0, train/true_negatives=12800.0, train/true_positives=6400.0
I0829 07:18:37.609363 140318776715072 local.py:41] Setting work unit notes: 0.3 steps/s, 60.6% (13778/22724), ETA: 8h52m
I0829 07:18:37.611700 140305134778112 logging_writer.py:48] [13778] steps_per_sec=0.280118
I0829 07:18:37.611787 140305134778112 logging_writer.py:48] [13778] uptime=74267.6
I0829 07:19:56.116713 140305134778112 logging_writer.py:48] [13800] epoch=0, train/categorical_accuracy=1.0, train/categorical_crossentropy=0.5520044565200806, train/f1_het=0.0, train/f1_homalt=0.0, train/f1_homref=1.0, train/f1_macro=0.3333333432674408, train/f1_micro=1.0, train/f1_weighted=1.0, train/false_negatives=0.0, train/false_positives=0.0, train/learning_rate=9.999999747378752e-05, train/loss=0.5520052909851074, train/precision=1.0, train/precision_het=0.0, train/precision_homalt=0.0, train/precision_homref=1.0, train/recall=1.0, train/recall_het=0.0, train/recall_homalt=0.0, train/recall_homref=1.0, train/true_negatives=12800.0, train/true_positives=6400.0
I0829 07:23:41.076397 140318776715072 local.py:41] Setting work unit notes: 0.3 steps/s, 61.0% (13863/22724), ETA: 8h47m
I0829 07:23:41.078737 140305134778112 logging_writer.py:48] [13863] steps_per_sec=0.280096
I0829 07:23:41.078827 140305134778112 logging_writer.py:48] [13863] uptime=74571.1
I0829 07:25:53.180995 140305134778112 logging_writer.py:48] [13900] epoch=0, train/categorical_accuracy=1.0, train/categorical_crossentropy=0.5519936680793762, train/f1_het=0.0, train/f1_homalt=0.0, train/f1_homref=1.0, train/f1_macro=0.3333333432674408, train/f1_micro=1.0, train/f1_weighted=1.0, train/false_negatives=0.0, train/false_positives=0.0, train/learning_rate=9.999999747378752e-05, train/loss=0.5519945025444031, train/precision=1.0, train/precision_het=0.0, train/precision_homalt=0.0, train/precision_homref=1.0, train/recall=1.0, train/recall_het=0.0, train/recall_homalt=0.0, train/recall_homref=1.0, train/true_negatives=12800.0, train/true_positives=6400.0
I0829 07:28:44.566404 140318776715072 local.py:41] Setting work unit notes: 0.3 steps/s, 61.4% (13948/22724), ETA: 8h42m
I0829 07:28:44.568708 140305134778112 logging_writer.py:48] [13948] steps_per_sec=0.280075
I0829 07:28:44.568793 140305134778112 logging_writer.py:48] [13948] uptime=74874.6
I0829 07:31:25.151819 140318776715072 train.py:361] Running tune at step=13993 epoch=0
I0829 07:31:25.152109 140318776715072 train.py:366] Tune step 0 / 3162 (0.0%)
I0829 07:33:17.573163 140318776715072 train.py:366] Tune step 100 / 3162 (0.0%)
I0829 07:35:10.013494 140318776715072 train.py:366] Tune step 200 / 3162 (10.0%)
I0829 07:37:02.497336 140318776715072 train.py:366] Tune step 300 / 3162 (10.0%)
I0829 07:38:54.834164 140318776715072 train.py:366] Tune step 400 / 3162 (10.0%)
I0829 07:40:47.319165 140318776715072 train.py:366] Tune step 500 / 3162 (20.0%)
I0829 07:42:39.802007 140318776715072 train.py:366] Tune step 600 / 3162 (20.0%)
I0829 07:44:32.297624 140318776715072 train.py:366] Tune step 700 / 3162 (20.0%)
I0829 07:46:24.658610 140318776715072 train.py:366] Tune step 800 / 3162 (30.0%)
I0829 07:48:17.176530 140318776715072 train.py:366] Tune step 900 / 3162 (30.0%)
I0829 07:50:09.700463 140318776715072 train.py:366] Tune step 1000 / 3162 (30.0%)
I0829 07:52:02.226121 140318776715072 train.py:366] Tune step 1100 / 3162 (30.0%)
I0829 07:53:54.613348 140318776715072 train.py:366] Tune step 1200 / 3162 (40.0%)
I0829 07:55:47.134974 140318776715072 train.py:366] Tune step 1300 / 3162 (40.0%)
I0829 07:57:39.682815 140318776715072 train.py:366] Tune step 1400 / 3162 (40.0%)
I0829 07:59:32.215537 140318776715072 train.py:366] Tune step 1500 / 3162 (50.0%)
I0829 08:01:24.651632 140318776715072 train.py:366] Tune step 1600 / 3162 (50.0%)
I0829 08:03:17.188146 140318776715072 train.py:366] Tune step 1700 / 3162 (50.0%)
I0829 08:05:09.741266 140318776715072 train.py:366] Tune step 1800 / 3162 (60.0%)
I0829 08:07:02.262498 140318776715072 train.py:366] Tune step 1900 / 3162 (60.0%)
I0829 08:08:54.673932 140318776715072 train.py:366] Tune step 2000 / 3162 (60.0%)
I0829 08:10:47.221370 140318776715072 train.py:366] Tune step 2100 / 3162 (70.0%)
I0829 08:12:39.774174 140318776715072 train.py:366] Tune step 2200 / 3162 (70.0%)
I0829 08:14:32.322385 140318776715072 train.py:366] Tune step 2300 / 3162 (70.0%)
I0829 08:16:24.722720 140318776715072 train.py:366] Tune step 2400 / 3162 (80.0%)
I0829 08:18:17.252759 140318776715072 train.py:366] Tune step 2500 / 3162 (80.0%)
I0829 08:20:09.823046 140318776715072 train.py:366] Tune step 2600 / 3162 (80.0%)
I0829 08:22:02.367495 140318776715072 train.py:366] Tune step 2700 / 3162 (90.0%)
I0829 08:23:54.783612 140318776715072 train.py:366] Tune step 2800 / 3162 (90.0%)
I0829 08:25:47.336242 140318776715072 train.py:366] Tune step 2900 / 3162 (90.0%)
I0829 08:27:39.775715 140318776715072 train.py:366] Tune step 3000 / 3162 (90.0%)
I0829 08:29:29.592094 140318776715072 train.py:366] Tune step 3100 / 3162 (100.0%)
I0829 08:30:42.583051 140305134778112 logging_writer.py:48] [13993] tune/categorical_accuracy=0.9916982650756836, tune/categorical_crossentropy=0.560210645198822, tune/f1_het=0.0, tune/f1_homalt=0.0, tune/f1_homref=0.9958318471908569, tune/f1_macro=0.33194395899772644, tune/f1_micro=0.9916982650756836, tune/f1_weighted=0.9958318471908569, tune/false_negatives_1=1777.0, tune/false_positives_1=1544.0, tune/loss=0.5603554248809814, tune/precision_1=0.9923615455627441, tune/precision_het=0.0, tune/precision_homalt=0.0, tune/precision_homref=1.0, tune/recall_1=0.9912189841270447, tune/recall_het=0.0, tune/recall_homalt=0.0, tune/recall_homref=0.9912189841270447, tune/true_negatives_1=403192.0, tune/true_positives_1=200591.0
I0829 08:30:42.590469 140318776715072 train.py:471] Skipping checkpoint with tune/f1_weighted=0.99583185 < previous best tune/f1_weighted=0.99845344
I0829 08:30:42.595992 140305134778112 logging_writer.py:48] [13993] tune/early_stopping=7
I0829 08:30:46.123329 140318776715072 local.py:41] Setting work unit notes: 0.0 steps/s, 61.6% (13994/22724), ETA: 8d4h11m
I0829 08:30:46.125013 140305134778112 logging_writer.py:48] [13994] steps_per_sec=0.0123604
I0829 08:30:46.125087 140305134778112 logging_writer.py:48] [13994] uptime=78596.1
I0829 08:31:07.673585 140305134778112 logging_writer.py:48] [14000] epoch=0, train/categorical_accuracy=1.0, train/categorical_crossentropy=0.5519920587539673, train/f1_het=0.0, train/f1_homalt=0.0, train/f1_homref=1.0, train/f1_macro=0.3333333432674408, train/f1_micro=1.0, train/f1_weighted=1.0, train/false_negatives=0.0, train/false_positives=0.0, train/learning_rate=9.999999747378752e-05, train/loss=0.551992654800415, train/precision=1.0, train/precision_het=0.0, train/precision_homalt=0.0, train/precision_homref=1.0, train/recall=1.0, train/recall_het=0.0, train/recall_homalt=0.0, train/recall_homref=1.0, train/true_negatives=12800.0, train/true_positives=6400.0

I am new to Deep Learning and am struggling to decide whether something is wrong with my training approach/scripts or whether the model just needs more time / different hyperparams. Given the number of examples, I can only run 1 epoch at a time before I hit the 24hr cluster wall-time limit. So I have only trained for around 30,000 steps in total across 2 epochs so far (starting from last checkpoint after 1st epoch).

All advice much appreciated!

kishwarshafin commented 2 months ago

Hi @DanJeffries ,

Can you please paste the output for the following command here:

cat /home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt

and also separately run the following and paste the output:

cat /home/examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt
DanJeffries commented 2 months ago

Hi @kishwarshafin ,

Sure. Just a quick note first to explain the outputs, and maybe its relevant to the problem. Given the number of examples I couldn't easily perform the shuffling step on my local cluster (using DirectRunner) due to memory and wall time limits. So I performed a 2-step shuffle. I.e. I split the examples in half (parts 1 and 2), shuffled each half, then randomly split the outputs from each of the first shuffles into two halves (parts 3 and 4) and ran a second round of shuffling.

I then edited the path in the pbtxt file to accommodate all file names. So All_samples_training_examples.dataset_config.pbtxt now contains the following:

 > cat /home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt

# Generated by shuffle_tfrecords_beam.py
# class0: 1454377

name: "Shuffle_global"
tfrecord_path: "/home/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt?-?????-of-?????.tfrecord.gz"
num_examples: 1454377

#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-?????-of-?????.tfrecord.gz"
#num_examples: 727189

#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt3.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3
#

# Generated by shuffle_tfrecords_beam.py

#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-?????-of-?????.tfrecord.gz"
#num_examples: 727188
# class0: 727188
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt4.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4
#

I assumed that the commented out lines are not read, so I added some extra to keep track of the various shuffling steps.

and FYI,

ls /home/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt?-?????-of-?????.tfrecord.gz 

gives the correct set of shuffled files:

./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00000-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00001-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00002-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00003-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00004-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00005-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00006-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00007-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00008-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00009-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00010-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00011-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00012-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00013-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00014-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00015-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00016-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00017-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00018-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00019-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00000-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00001-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00002-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00003-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00004-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00005-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00006-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00007-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00008-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00009-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00010-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00011-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00012-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00013-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00014-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00015-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00016-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00017-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00018-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00019-of-00020.tfrecord.gz

And for the tuning set (which was shuffled normally using just one step):

>cat ./examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt

# Generated by shuffle_tfrecords_beam.py

name: "Shuffle_global"
tfrecord_path: "/home/examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 202421
# class0: 202421
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training//examples/tune_all/*tune_examples*tfrecord-000*-of-00040
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training//examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled
#

and

ls /home/examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-?????-of-?????.tfrecord.gz

gives the desired file list again:

./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00000-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00001-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00002-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00003-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00004-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00005-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00006-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00007-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00008-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00009-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00010-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00011-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00012-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00013-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00014-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00015-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00016-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00017-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00018-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00019-of-00020.tfrecord.gz

Hope thats useful!

kishwarshafin commented 2 months ago

@DanJeffries ,

I don't think shuffling is the issue in your training set. You can see that this is the class distribution for your training data:

#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-?????-of-?????.tfrecord.gz"
#num_examples: 727188
# class0: 727188
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt4.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4
#

Meaning all of the examples you created are of class 0. Can you check if your truth VCF has 1/1 and 0/1 variants?

kishwarshafin commented 2 months ago

For example, on a training set on our end, the pbtxt file's output looks like this:

# Classes:
#   class 0:    855059086
#   class 1:    1443187583
#   class 2:    947352461

# Indel or SNP:
#   Indel: 1227997460
#   SNP: 2017601670

Can you bcftools stats on your truth set.

DanJeffries commented 2 months ago

Hi @kishwarshafin ,

Ah ok that explains it. Yes I definitely have variants in my VCF, between 850k - 1.15M for each of 5 samples. Here is the output of bcftools stats for one of the samples:

SR_male_1_alltruth.stats.gz

Perhaps it is my make_examples command? I made examples for sample separately using the below command.

apptainer run \
-B $WD:/wd \
$DV_PATH  \
parallel -q --halt 2 --line-buffer \
/opt/deepvariant/bin/make_examples \
--mode training \
--ref $REF \
--reads /wd/bams/${SAMPLE}.fixmate.coordsorted.bam \
--truth_variants /wd/Filtered_variants/${SAMPLE}.ALL_TRUTH_VARS.CORRECTED.vcf.gz \
--confident_regions /wd/Confident_regions/${SAMPLE_BED_NAME}.conf.bed \
--examples /wd/examples/train/${SAMPLE}/training_examples.tfrecord@20 \
--regions /wd/training_regions/${CROSS}_train_partitions.bed \
--channels "insert_size" \
--task {} ::: `seq 0 19` #split the task into 20 jobs

And here's an excert from this sample's VCF:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SR_male_1
NC_053212.1_chromosome_1        1449    .       C       T,<NON_REF>     347.6   .       .       GT:AD:DP:GQ:PL:SB       0/1:11,10,0:21:99:355,0,328,388,358,746:4,7,3,7
NC_053212.1_chromosome_1        2214    .       T       TTGTTTGAC,<NON_REF>     750.06  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,15,0:15:51:764,51,0,765,51,765:0,0,2,13
NC_053212.1_chromosome_1        4741    .       C       T,<NON_REF>     807.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,21,0:21:63:821,63,0,821,63,821:0,0,9,12
NC_053212.1_chromosome_1        5560    .       G       A,<NON_REF>     1001.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,27,0:27:81:1015,81,0,1015,81,1015:0,0,19,8
NC_053212.1_chromosome_1        5876    .       C       A,<NON_REF>     501.6   .       .       GT:AD:DP:GQ:PL:SB       0/1:15,16,0:31:99:509,0,495,554,543,1097:3,12,7,9
NC_053212.1_chromosome_1        6440    .       C       T,<NON_REF>     701.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,19,0:19:57:715,57,0,715,57,715:0,0,9,10
NC_053212.1_chromosome_1        8670    .       A       G,<NON_REF>     1198.03 .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,26,0:26:81:1|1:8670_A_G:1212,81,0,1212,81,1212:8670:>
NC_053212.1_chromosome_1        8682    .       G       C,<NON_REF>     1281.03 .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,29,0:29:87:1|1:8670_A_G:1295,87,0,1295,87,1295:8670:>
NC_053212.1_chromosome_1        12561   .       C       G,<NON_REF>     699.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,19,0:19:57:713,57,0,713,57,713:0,0,7,12
NC_053212.1_chromosome_1        14290   .       G       GA,<NON_REF>    1253.06 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,36,0:36:99:1267,108,0,1267,108,1267:0,0,20,16
NC_053212.1_chromosome_1        24265   .       A       G,<NON_REF>     1344.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,35,0:35:99:1358,105,0,1358,105,1358:0,0,14,21
NC_053212.1_chromosome_1        24294   .       G       C,<NON_REF>     1547.03 .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,34,0:34:99:1|1:24294_G_C:1561,105,0,1561,105,1561:24>
NC_053212.1_chromosome_1        24307   .       GC      G,<NON_REF>     1422.03 .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,32,0:32:96:1|1:24294_G_C:1436,96,0,1436,96,1436:2429>
NC_053212.1_chromosome_1        26157   .       A       C,<NON_REF>     950.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,25,0:25:75:964,75,0,964,75,964:0,0,9,16
NC_053212.1_chromosome_1        26296   .       A       G,<NON_REF>     1166.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,30,0:30:90:1180,90,0,1180,90,1180:0,0,15,15
NC_053212.1_chromosome_1        27181   .       C       T,<NON_REF>     834.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,23,0:23:69:848,69,0,848,69,848:0,0,13,10
NC_053212.1_chromosome_1        29086   .       T       A,<NON_REF>     851.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,21,0:21:63:865,63,0,865,63,865:0,0,11,10
NC_053212.1_chromosome_1        31158   .       A       C,<NON_REF>     588.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,15,0:15:45:602,45,0,602,45,602:0,0,7,8
NC_053212.1_chromosome_1        37394   .       TCCGGGGGTCCGGGCCCCCCCCCCCCC     T,<NON_REF>     886.03  .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,20,0:20:60:1|1:37379_T_A:900>
NC_053212.1_chromosome_1        39747   .       C       A,<NON_REF>     660.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,17,0:17:51:674,51,0,674,51,674:0,0,13,4
NC_053212.1_chromosome_1        42506   .       C       T,<NON_REF>     1121.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,28,0:28:84:1135,84,0,1135,84,1135:0,0,12,16
NC_053212.1_chromosome_1        46081   .       A       G,<NON_REF>     620.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,16,0:16:48:634,48,0,634,48,634:0,0,4,12
NC_053212.1_chromosome_1        47173   .       G       A,<NON_REF>     1059.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,29,0:29:87:1073,87,0,1073,87,1073:0,0,5,24
NC_053212.1_chromosome_1        47399   .       TTG     T,<NON_REF>     675.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,19,0:19:57:689,57,0,689,57,689:0,0,16,3
NC_053212.1_chromosome_1        47570   .       G       C,<NON_REF>     385.6   .       .       GT:AD:DP:GQ:PL:SB       0/1:11,11,0:22:99:393,0,381,426,414,841:6,5,5,6
NC_053212.1_chromosome_1        47768   .       ATG     A,<NON_REF>     812.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,22,0:22:66:826,66,0,826,66,826:0,0,9,13
NC_053212.1_chromosome_1        48014   .       CA      C,<NON_REF>     876.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,24,0:24:72:890,72,0,890,72,890:0,0,16,8
NC_053212.1_chromosome_1        48426   .       A       G,<NON_REF>     780.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,19,0:19:57:794,57,0,794,57,794:0,0,9,10
NC_053212.1_chromosome_1        50624   .       T       C,<NON_REF>     1021.03 .       .       GT:AD:DP:GQ:PGT:PID:PL:PS:SB    1|1:0,22,0:22:69:1|1:50616_C_T:1035,69,0,1035,69,1035:5061>
NC_053212.1_chromosome_1        50765   .       TC      T,<NON_REF>     1005.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:2,26,0:28:64:1019,64,0,1024,78,1038:1,1,14,12
NC_053212.1_chromosome_1        50887   .       G       T,<NON_REF>     856.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,23,0:23:69:870,69,0,870,69,870:0,0,10,13
NC_053212.1_chromosome_1        50971   .       A       T,<NON_REF>     699.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,19,0:19:57:713,57,0,713,57,713:0,0,9,10
NC_053212.1_chromosome_1        51160   .       C       T,<NON_REF>     1100.03 .       .       GT:AD:DP:GQ:PL:SB       1/1:0,31,0:31:92:1114,92,0,1114,92,1114:0,0,10,21
NC_053212.1_chromosome_1        53199   .       TCA     T,<NON_REF>     767.03  .       .       GT:AD:DP:GQ:PL:SB       1/1:0,20,0:20:60:781,60,0,781,60,781:0,0,7,13

And I have attached the log file for this make_examples for this sample

MAKE_EX_TRAIN_3148249-3.err.gz

Thanks

Dan

kishwarshafin commented 2 months ago

@DanJeffries, sorry for the late reply I was traveling for a conference.

Looking at the log, most of it looks like this:

I0812 17:25:36.970339 140640080795456 make_examples_core.py:301] Task 18/20: 130000 candidates (5465 examples) [24.32s elapsed]
I0812 17:25:38.358597 140641138153280 make_examples_core.py:301] Task 17/20: 136011 candidates (5980 examples) [24.30s elapsed]
I0812 17:25:38.452311 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:39.808730 139930050549568 make_examples_core.py:301] Task 19/20: 130009 candidates (5415 examples) [19.65s elapsed]
I0812 17:25:40.161706 140719761319744 make_examples_core.py:301] Task 1/20: 130009 candidates (5656 examples) [26.55s elapsed]
I0812 17:25:39.762312 140656897058624 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:40.178942 139929751582528 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:41.815151 139685487241024 make_examples_core.py:301] Task 8/20: 134010 candidates (5603 examples) [20.32s elapsed]
I0812 17:25:42.062644 140656897058624 make_examples_core.py:301] Task 14/20: 130007 candidates (5538 examples) [26.62s elapsed]
I0812 17:25:42.944558 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:42.945389 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 944784.0, which exceeds max(=100000)
I0812 17:25:42.940323 140366879631168 make_examples_core.py:301] Task 6/20: 132005 candidates (5726 examples) [25.32s elapsed]
I0812 17:25:43.504171 139929751582528 make_examples_core.py:301] Task 4/20: 132003 candidates (5481 examples) [23.80s elapsed]
I0812 17:25:43.563577 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 131220.0, which exceeds max(=100000)
I0812 17:25:45.208675 140012542068544 make_examples_core.py:301] Task 12/20: 134003 candidates (5624 examples) [21.31s elapsed]
I0812 17:25:45.249171 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 275562.0, which exceeds max(=100000)
I0812 17:25:44.861580 140290187421504 make_examples_core.py:301] Task 0/20: 134001 candidates (5695 examples) [22.80s elapsed]
I0812 17:25:44.957572 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:44.749801 140624495118144 make_examples_core.py:301] Task 9/20: 132004 candidates (5675 examples) [26.91s elapsed]
I0812 17:25:45.658525 139725838563136 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:48.367301 140203250202432 make_examples_core.py:301] Task 3/20: 130018 candidates (5401 examples) [24.78s elapsed]
I0812 17:25:49.316477 140120687957824 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:50.556129 140585679882048 make_examples_core.py:301] Task 16/20: 132012 candidates (5581 examples) [21.74s elapsed]
I0812 17:25:51.780121 140641138153280 make_examples_core.py:301] Task 17/20: 138019 candidates (6057 examples) [13.42s elapsed]

It really looks like you have a lot of variants, is this expected for your sample? What's happening here is that the haplotype_labler is trying to label the candidates but failing because there are way too many combinations and it's giving up. What is the truth that you are using for this sample?

DanJeffries commented 2 months ago

Hi @kishwarshafin ,

No worries, thanks for finding the time to get back to me! And thanks for the explanation - I had read that line in the log file "Not including more", as it is including some.

So the truth data is generated from the offspring of 5 trios. Variants for the parents and offspring were called (using GATK4) against a reference, mendelian expectations were checked for each locus, and the loci that passed that check, as well as some hard filters (e.g. depth, GQ etc), were kept for the truth set.

The variants in the truth set are the combination of all the variants that passed these filters in all 5 trios, which amounts to around 450,000 variants (split into ~350k training, and ~100k for tuning). The reference is from a different population which which will probably result in more hom_alt SNPs against the ref, but other than that, I don't think this number of SNPs is particularly high for a 400Mb genome of a wild fish with relatively large pop sizes. The 0.5x downsampling of course doubles this number, resulting in ~900,000 truth vars in total.

So, could a solution be to increase the maximum for genotype_options_product? Or would you suggest subsampling the truth variants?

Thanks! Dan

kishwarshafin commented 2 months ago

Hi @DanJeffries ,

To start with, can you add this to your make_examples command and see if it helps:

--labeler_algorithm=positional_labeler \

This should switch the labeling from haplotype to positional and it should solve this issue. Let me know if this works.

DanJeffries commented 2 months ago

Hi @kishwarshafin ,

I made the change you suggested. It didn't fix the issue, however you did help point me to the real cause. It turns out that when I created the confident regions bed file I failed to include the positions of the variants (only confident hom_ref positions). Having gone back and added these positions, examples now contain all 3 classes of site.

One last question - given that the labeler was not the issue, should I switch back to using the haplotype aware labeller? I read somewhere that this is the better approach, though I am not sure why that is.

Once I have heard from you on this last point I'll close the issue.

Thanks

Dan

kishwarshafin commented 2 months ago

Glad you figured it out! So, haplotype_labler is very good for labeling INDELs specially in regions where you observe global alignment jitter. On the other hand, SNPs usually get placed at the correct position almost every time so with positional_labler you will see more SNPs correctly annotated. There's pros and cons to both, but, given you have SO MANY SNP variants, I think I'd try positional_labler for this use-case if I were you given how sophisticated haplotype_labler is and it will not annotate in many variant dense regions. But, you are welcome to try them both, usually the difference should be minimal.

Sorry for the vaguest answer, but there's no correct answer here. Hope your training goes well.

DanJeffries commented 2 months ago

Hi @kishwarshafin,

Ok thanks for the explanation.

Unfortunately, I am still having trouble. For some reason when I use positional_labeler my make_examples jobs fail. But they complete successfully using haplotype_labeler. I can't figure out what the issue is as the error message isn't particularly informative, at least not to me. I attach two log files from the make_examples step. Both are for the same sample, the only difference is that one uses haplotype_labeler (job succeeded) and the other uses positional_labeler (job failed).

It would be great to get your opinion on what is going on. Note that I have tested positional labeler a few times and it does seem to work for one sample, but there is no reason this sample should be distinct from the others.

haplotype_labeler: MAKE_EX_TRAIN_NEW_4926611-3.err.gz

positional_labeler: MAKE_EX_TRAIN_NEW_4930167-3.err.gz

Thanks

Dan

kishwarshafin commented 2 months ago

So it looks like the VCF parsing is failing here:

I0911 20:23:50.839811 140713511122752 positional_labeler.py:163] Multiple matches detected; no good match found. Fall back to first. variant: reference_bases: "G"
alternate_bases: "A"
info {
  key: "BAM_FNAME"
  value {
    values {
      string_value: "SR_male_1.fixmate.coordsorted.bam"
    }
  }
}
calls {
  info {
    key: "AD"
    value {
      values {
        int_value: 0
      }
      values {
        int_value: 22
      }
    }
  }
  info {
    key: "DP"
    value {
      values {
        int_value: 22
      }
    }
  }
  info {
    key: "VAF"
    value {
      values {
        number_value: 1.0
      }
    }
  }
  genotype: -1
  genotype: -1
  call_set_name: "SR_male_1"
}
end: 14057936
reference_name: "NC_053213.1_chromosome_2"
start: 14057935
: matches: [reference_bases: "G"
alternate_bases: "A"
alternate_bases: "<NON_REF>"

Looks like your truth contains alleles like "" and when positional lableler tries to match something it fails. Is your truth VCF consistent with the regular VCF standards? You may need to filter non-standard calls like "NON_REF" as an alt allele from the truth to fix this.

DanJeffries commented 2 months ago

Hi @kishwarshafin ,

Yeh these are standard VCFs, outputted by GenotypeGVCFs (GATK 4.1.3) and filtered by BCFtools.

After having done some more tests, I don't think the existence of alleles is the issue for several reasons. 1) These alleles occur many times in the VCF well before the job fails. 2) The job often fails on loci that do not have such alleles. 3) After removing them (and all ALT alleles not found in the sample in each VCF) the jobs still fail.

I cannot see any pattern in the loci reported in the log file at the point at which the job fails. And still, some jobs succeed. Anyway, I will keep digging and close this issue as the original question was solved.

Thanks for the help.