Closed DanJeffries closed 2 months ago
Hi @DanJeffries ,
Can you please paste the output for the following command here:
cat /home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt
and also separately run the following and paste the output:
cat /home/examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt
Hi @kishwarshafin ,
Sure. Just a quick note first to explain the outputs, and maybe its relevant to the problem. Given the number of examples I couldn't easily perform the shuffling step on my local cluster (using DirectRunner) due to memory and wall time limits. So I performed a 2-step shuffle. I.e. I split the examples in half (parts 1 and 2), shuffled each half, then randomly split the outputs from each of the first shuffles into two halves (parts 3 and 4) and ran a second round of shuffling.
I then edited the path in the pbtxt file to accommodate all file names. So All_samples_training_examples.dataset_config.pbtxt
now contains the following:
> cat /home/examples_shuffled/train/All_samples_training_examples.dataset_config.pbtxt
# Generated by shuffle_tfrecords_beam.py
# class0: 1454377
name: "Shuffle_global"
tfrecord_path: "/home/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt?-?????-of-?????.tfrecord.gz"
num_examples: 1454377
#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-?????-of-?????.tfrecord.gz"
#num_examples: 727189
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt3.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3
#
# Generated by shuffle_tfrecords_beam.py
#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-?????-of-?????.tfrecord.gz"
#num_examples: 727188
# class0: 727188
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt4.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4
#
I assumed that the commented out lines are not read, so I added some extra to keep track of the various shuffling steps.
and FYI,
ls /home/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt?-?????-of-?????.tfrecord.gz
gives the correct set of shuffled files:
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00000-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00001-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00002-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00003-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00004-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00005-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00006-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00007-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00008-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00009-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00010-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00011-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00012-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00013-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00014-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00015-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00016-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00017-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00018-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt3-00019-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00000-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00001-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00002-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00003-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00004-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00005-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00006-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00007-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00008-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00009-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00010-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00011-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00012-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00013-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00014-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00015-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00016-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00017-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00018-of-00020.tfrecord.gz
./examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-00019-of-00020.tfrecord.gz
And for the tuning set (which was shuffled normally using just one step):
>cat ./examples_shuffled/tune/All_samples_tune_examples.dataset_config.pbtxt
# Generated by shuffle_tfrecords_beam.py
name: "Shuffle_global"
tfrecord_path: "/home/examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 202421
# class0: 202421
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training//examples/tune_all/*tune_examples*tfrecord-000*-of-00040
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training//examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled
#
and
ls /home/examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-?????-of-?????.tfrecord.gz
gives the desired file list again:
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00000-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00001-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00002-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00003-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00004-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00005-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00006-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00007-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00008-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00009-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00010-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00011-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00012-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00013-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00014-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00015-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00016-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00017-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00018-of-00020.tfrecord.gz
./examples_shuffled/tune/All_samples_all_tune_examples_inc_downsampled_05.shuffled-00019-of-00020.tfrecord.gz
Hope thats useful!
@DanJeffries ,
I don't think shuffling is the issue in your training set. You can see that this is the class distribution for your training data:
#name: "Shuffle_global"
#tfrecord_path: "/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4-?????-of-?????.tfrecord.gz"
#num_examples: 727188
# class0: 727188
#
# --input_pattern_list=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_inputs/All_samples_all_training_examples_inc_downsampled_05_pt4.shuffled-000*-of-00020.tfrecord.gz
# --output_pattern_prefix=/storage/scratch/iee/dj20y461/Stickleback/G_aculeatus/FITNESS/DV_training/examples_shuffled/train/shuffle_2_outputs/All_samples_all_training_examples_inc_downsampled_05.shuffle_2_pt4
#
Meaning all of the examples you created are of class 0. Can you check if your truth VCF has 1/1 and 0/1 variants?
For example, on a training set on our end, the pbtxt file's output looks like this:
# Classes:
# class 0: 855059086
# class 1: 1443187583
# class 2: 947352461
# Indel or SNP:
# Indel: 1227997460
# SNP: 2017601670
Can you bcftools stats
on your truth set.
Hi @kishwarshafin ,
Ah ok that explains it. Yes I definitely have variants in my VCF, between 850k - 1.15M for each of 5 samples. Here is the output of bcftools stats
for one of the samples:
Perhaps it is my make_examples command? I made examples for sample separately using the below command.
apptainer run \
-B $WD:/wd \
$DV_PATH \
parallel -q --halt 2 --line-buffer \
/opt/deepvariant/bin/make_examples \
--mode training \
--ref $REF \
--reads /wd/bams/${SAMPLE}.fixmate.coordsorted.bam \
--truth_variants /wd/Filtered_variants/${SAMPLE}.ALL_TRUTH_VARS.CORRECTED.vcf.gz \
--confident_regions /wd/Confident_regions/${SAMPLE_BED_NAME}.conf.bed \
--examples /wd/examples/train/${SAMPLE}/training_examples.tfrecord@20 \
--regions /wd/training_regions/${CROSS}_train_partitions.bed \
--channels "insert_size" \
--task {} ::: `seq 0 19` #split the task into 20 jobs
And here's an excert from this sample's VCF:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SR_male_1
NC_053212.1_chromosome_1 1449 . C T,<NON_REF> 347.6 . . GT:AD:DP:GQ:PL:SB 0/1:11,10,0:21:99:355,0,328,388,358,746:4,7,3,7
NC_053212.1_chromosome_1 2214 . T TTGTTTGAC,<NON_REF> 750.06 . . GT:AD:DP:GQ:PL:SB 1/1:0,15,0:15:51:764,51,0,765,51,765:0,0,2,13
NC_053212.1_chromosome_1 4741 . C T,<NON_REF> 807.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,21,0:21:63:821,63,0,821,63,821:0,0,9,12
NC_053212.1_chromosome_1 5560 . G A,<NON_REF> 1001.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,27,0:27:81:1015,81,0,1015,81,1015:0,0,19,8
NC_053212.1_chromosome_1 5876 . C A,<NON_REF> 501.6 . . GT:AD:DP:GQ:PL:SB 0/1:15,16,0:31:99:509,0,495,554,543,1097:3,12,7,9
NC_053212.1_chromosome_1 6440 . C T,<NON_REF> 701.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:715,57,0,715,57,715:0,0,9,10
NC_053212.1_chromosome_1 8670 . A G,<NON_REF> 1198.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,26,0:26:81:1|1:8670_A_G:1212,81,0,1212,81,1212:8670:>
NC_053212.1_chromosome_1 8682 . G C,<NON_REF> 1281.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,29,0:29:87:1|1:8670_A_G:1295,87,0,1295,87,1295:8670:>
NC_053212.1_chromosome_1 12561 . C G,<NON_REF> 699.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:713,57,0,713,57,713:0,0,7,12
NC_053212.1_chromosome_1 14290 . G GA,<NON_REF> 1253.06 . . GT:AD:DP:GQ:PL:SB 1/1:0,36,0:36:99:1267,108,0,1267,108,1267:0,0,20,16
NC_053212.1_chromosome_1 24265 . A G,<NON_REF> 1344.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,35,0:35:99:1358,105,0,1358,105,1358:0,0,14,21
NC_053212.1_chromosome_1 24294 . G C,<NON_REF> 1547.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,34,0:34:99:1|1:24294_G_C:1561,105,0,1561,105,1561:24>
NC_053212.1_chromosome_1 24307 . GC G,<NON_REF> 1422.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,32,0:32:96:1|1:24294_G_C:1436,96,0,1436,96,1436:2429>
NC_053212.1_chromosome_1 26157 . A C,<NON_REF> 950.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,25,0:25:75:964,75,0,964,75,964:0,0,9,16
NC_053212.1_chromosome_1 26296 . A G,<NON_REF> 1166.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,30,0:30:90:1180,90,0,1180,90,1180:0,0,15,15
NC_053212.1_chromosome_1 27181 . C T,<NON_REF> 834.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,23,0:23:69:848,69,0,848,69,848:0,0,13,10
NC_053212.1_chromosome_1 29086 . T A,<NON_REF> 851.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,21,0:21:63:865,63,0,865,63,865:0,0,11,10
NC_053212.1_chromosome_1 31158 . A C,<NON_REF> 588.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,15,0:15:45:602,45,0,602,45,602:0,0,7,8
NC_053212.1_chromosome_1 37394 . TCCGGGGGTCCGGGCCCCCCCCCCCCC T,<NON_REF> 886.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,20,0:20:60:1|1:37379_T_A:900>
NC_053212.1_chromosome_1 39747 . C A,<NON_REF> 660.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,17,0:17:51:674,51,0,674,51,674:0,0,13,4
NC_053212.1_chromosome_1 42506 . C T,<NON_REF> 1121.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,28,0:28:84:1135,84,0,1135,84,1135:0,0,12,16
NC_053212.1_chromosome_1 46081 . A G,<NON_REF> 620.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,16,0:16:48:634,48,0,634,48,634:0,0,4,12
NC_053212.1_chromosome_1 47173 . G A,<NON_REF> 1059.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,29,0:29:87:1073,87,0,1073,87,1073:0,0,5,24
NC_053212.1_chromosome_1 47399 . TTG T,<NON_REF> 675.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:689,57,0,689,57,689:0,0,16,3
NC_053212.1_chromosome_1 47570 . G C,<NON_REF> 385.6 . . GT:AD:DP:GQ:PL:SB 0/1:11,11,0:22:99:393,0,381,426,414,841:6,5,5,6
NC_053212.1_chromosome_1 47768 . ATG A,<NON_REF> 812.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,22,0:22:66:826,66,0,826,66,826:0,0,9,13
NC_053212.1_chromosome_1 48014 . CA C,<NON_REF> 876.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,24,0:24:72:890,72,0,890,72,890:0,0,16,8
NC_053212.1_chromosome_1 48426 . A G,<NON_REF> 780.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:794,57,0,794,57,794:0,0,9,10
NC_053212.1_chromosome_1 50624 . T C,<NON_REF> 1021.03 . . GT:AD:DP:GQ:PGT:PID:PL:PS:SB 1|1:0,22,0:22:69:1|1:50616_C_T:1035,69,0,1035,69,1035:5061>
NC_053212.1_chromosome_1 50765 . TC T,<NON_REF> 1005.03 . . GT:AD:DP:GQ:PL:SB 1/1:2,26,0:28:64:1019,64,0,1024,78,1038:1,1,14,12
NC_053212.1_chromosome_1 50887 . G T,<NON_REF> 856.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,23,0:23:69:870,69,0,870,69,870:0,0,10,13
NC_053212.1_chromosome_1 50971 . A T,<NON_REF> 699.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,19,0:19:57:713,57,0,713,57,713:0,0,9,10
NC_053212.1_chromosome_1 51160 . C T,<NON_REF> 1100.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,31,0:31:92:1114,92,0,1114,92,1114:0,0,10,21
NC_053212.1_chromosome_1 53199 . TCA T,<NON_REF> 767.03 . . GT:AD:DP:GQ:PL:SB 1/1:0,20,0:20:60:781,60,0,781,60,781:0,0,7,13
And I have attached the log file for this make_examples for this sample
MAKE_EX_TRAIN_3148249-3.err.gz
Thanks
Dan
@DanJeffries, sorry for the late reply I was traveling for a conference.
Looking at the log, most of it looks like this:
I0812 17:25:36.970339 140640080795456 make_examples_core.py:301] Task 18/20: 130000 candidates (5465 examples) [24.32s elapsed]
I0812 17:25:38.358597 140641138153280 make_examples_core.py:301] Task 17/20: 136011 candidates (5980 examples) [24.30s elapsed]
I0812 17:25:38.452311 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:39.808730 139930050549568 make_examples_core.py:301] Task 19/20: 130009 candidates (5415 examples) [19.65s elapsed]
I0812 17:25:40.161706 140719761319744 make_examples_core.py:301] Task 1/20: 130009 candidates (5656 examples) [26.55s elapsed]
I0812 17:25:39.762312 140656897058624 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:40.178942 139929751582528 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:41.815151 139685487241024 make_examples_core.py:301] Task 8/20: 134010 candidates (5603 examples) [20.32s elapsed]
I0812 17:25:42.062644 140656897058624 make_examples_core.py:301] Task 14/20: 130007 candidates (5538 examples) [26.62s elapsed]
I0812 17:25:42.944558 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:42.945389 139916975953728 haplotype_labeler.py:449] Not including more because genotype_options_product will be 944784.0, which exceeds max(=100000)
I0812 17:25:42.940323 140366879631168 make_examples_core.py:301] Task 6/20: 132005 candidates (5726 examples) [25.32s elapsed]
I0812 17:25:43.504171 139929751582528 make_examples_core.py:301] Task 4/20: 132003 candidates (5481 examples) [23.80s elapsed]
I0812 17:25:43.563577 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 131220.0, which exceeds max(=100000)
I0812 17:25:45.208675 140012542068544 make_examples_core.py:301] Task 12/20: 134003 candidates (5624 examples) [21.31s elapsed]
I0812 17:25:45.249171 140585679882048 haplotype_labeler.py:449] Not including more because genotype_options_product will be 275562.0, which exceeds max(=100000)
I0812 17:25:44.861580 140290187421504 make_examples_core.py:301] Task 0/20: 134001 candidates (5695 examples) [22.80s elapsed]
I0812 17:25:44.957572 140719761319744 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:44.749801 140624495118144 make_examples_core.py:301] Task 9/20: 132004 candidates (5675 examples) [26.91s elapsed]
I0812 17:25:45.658525 139725838563136 haplotype_labeler.py:449] Not including more because genotype_options_product will be 157464.0, which exceeds max(=100000)
I0812 17:25:48.367301 140203250202432 make_examples_core.py:301] Task 3/20: 130018 candidates (5401 examples) [24.78s elapsed]
I0812 17:25:49.316477 140120687957824 haplotype_labeler.py:449] Not including more because genotype_options_product will be 118098.0, which exceeds max(=100000)
I0812 17:25:50.556129 140585679882048 make_examples_core.py:301] Task 16/20: 132012 candidates (5581 examples) [21.74s elapsed]
I0812 17:25:51.780121 140641138153280 make_examples_core.py:301] Task 17/20: 138019 candidates (6057 examples) [13.42s elapsed]
It really looks like you have a lot of variants, is this expected for your sample? What's happening here is that the haplotype_labler is trying to label the candidates but failing because there are way too many combinations and it's giving up. What is the truth that you are using for this sample?
Hi @kishwarshafin ,
No worries, thanks for finding the time to get back to me! And thanks for the explanation - I had read that line in the log file "Not including more", as it is including some.
So the truth data is generated from the offspring of 5 trios. Variants for the parents and offspring were called (using GATK4) against a reference, mendelian expectations were checked for each locus, and the loci that passed that check, as well as some hard filters (e.g. depth, GQ etc), were kept for the truth set.
The variants in the truth set are the combination of all the variants that passed these filters in all 5 trios, which amounts to around 450,000 variants (split into ~350k training, and ~100k for tuning). The reference is from a different population which which will probably result in more hom_alt SNPs against the ref, but other than that, I don't think this number of SNPs is particularly high for a 400Mb genome of a wild fish with relatively large pop sizes. The 0.5x downsampling of course doubles this number, resulting in ~900,000 truth vars in total.
So, could a solution be to increase the maximum for genotype_options_product? Or would you suggest subsampling the truth variants?
Thanks! Dan
Hi @DanJeffries ,
To start with, can you add this to your make_examples command and see if it helps:
--labeler_algorithm=positional_labeler \
This should switch the labeling from haplotype to positional and it should solve this issue. Let me know if this works.
Hi @kishwarshafin ,
I made the change you suggested. It didn't fix the issue, however you did help point me to the real cause. It turns out that when I created the confident regions bed file I failed to include the positions of the variants (only confident hom_ref positions). Having gone back and added these positions, examples now contain all 3 classes of site.
One last question - given that the labeler was not the issue, should I switch back to using the haplotype aware labeller? I read somewhere that this is the better approach, though I am not sure why that is.
Once I have heard from you on this last point I'll close the issue.
Thanks
Dan
Glad you figured it out! So, haplotype_labler
is very good for labeling INDELs specially in regions where you observe global alignment jitter. On the other hand, SNPs usually get placed at the correct position almost every time so with positional_labler you will see more SNPs correctly annotated. There's pros and cons to both, but, given you have SO MANY SNP variants, I think I'd try positional_labler for this use-case if I were you given how sophisticated haplotype_labler is and it will not annotate in many variant dense regions. But, you are welcome to try them both, usually the difference should be minimal.
Sorry for the vaguest answer, but there's no correct answer here. Hope your training goes well.
Hi @kishwarshafin,
Ok thanks for the explanation.
Unfortunately, I am still having trouble. For some reason when I use positional_labeler my make_examples jobs fail. But they complete successfully using haplotype_labeler. I can't figure out what the issue is as the error message isn't particularly informative, at least not to me. I attach two log files from the make_examples step. Both are for the same sample, the only difference is that one uses haplotype_labeler (job succeeded) and the other uses positional_labeler (job failed).
It would be great to get your opinion on what is going on. Note that I have tested positional labeler a few times and it does seem to work for one sample, but there is no reason this sample should be distinct from the others.
haplotype_labeler: MAKE_EX_TRAIN_NEW_4926611-3.err.gz
positional_labeler: MAKE_EX_TRAIN_NEW_4930167-3.err.gz
Thanks
Dan
So it looks like the VCF parsing is failing here:
I0911 20:23:50.839811 140713511122752 positional_labeler.py:163] Multiple matches detected; no good match found. Fall back to first. variant: reference_bases: "G"
alternate_bases: "A"
info {
key: "BAM_FNAME"
value {
values {
string_value: "SR_male_1.fixmate.coordsorted.bam"
}
}
}
calls {
info {
key: "AD"
value {
values {
int_value: 0
}
values {
int_value: 22
}
}
}
info {
key: "DP"
value {
values {
int_value: 22
}
}
}
info {
key: "VAF"
value {
values {
number_value: 1.0
}
}
}
genotype: -1
genotype: -1
call_set_name: "SR_male_1"
}
end: 14057936
reference_name: "NC_053213.1_chromosome_2"
start: 14057935
: matches: [reference_bases: "G"
alternate_bases: "A"
alternate_bases: "<NON_REF>"
Looks like your truth contains alleles like "
Hi @kishwarshafin ,
Yeh these are standard VCFs, outputted by GenotypeGVCFs (GATK 4.1.3) and filtered by BCFtools.
After having done some more tests, I don't think the existence of
I cannot see any pattern in the loci reported in the log file at the point at which the job fails. And still, some jobs succeed. Anyway, I will keep digging and close this issue as the original question was solved.
Thanks for the help.
Hello DV team, and thanks for creating such a great tool!
I am currently trying to retrain the wgs model for a new species (a fish) however, during training, I see no evaluation statistics (precision, recall, f1) for either het or homalt. Or more specifically they are all 0.0. Eval stats are reported for homref though. I have now tried running the training several times with different hyperparameters but so far still no change at the het or homalt eval stats.
My first, very simple question is thus, are these eval stats truly 0 (i.e. the model is very bad) or is 0.0 some starting value and there are not enough data to calculate them initially? I am warmstarting from the 1.6.1 wgs model so I cant imagine the model is really that bad at calling variants initially, even if in a fish.
Setup Running on a university computing cluster (https://hpc-unibe-ch.github.io/) OS: Rocky 9.3 Blue Onyx GPU: rtx4090 Installation: Running from Docker image via singularity DV version: 1.6.1
Data I am training on examples from 5 individuals, data from Illumina NovaSeq ~20x coverage. 17/21 chromosomes used for training (~1.45M examples) 2/21 chromosomes used for tuning (~200k examples) 2/21 chromosomes reserved for testing. (Different chromosomes used for train/tune/test across samples - see below)
Shuffling Performed downsampling=0.5. Shuffled globally across samples, chromosomes and downsampling.
Command
My latest training run was like so:
Though previous runs had higher learning rates (0.01) and batch sizes (128). Training proceeds as follows:
Training Examples: 1454377 Batch Size: 64 Epochs: 1 Steps per epoch: 22724 Steps per tune: 3162 Num train steps: 22724
Log file
Here is the top of the log file, including some warnings in case they are relevant:
And here is an excerpt of from a later portion of the log file including some training and tuning steps, where you can see the 0.0 for het and homalt eval stats.
I am new to Deep Learning and am struggling to decide whether something is wrong with my training approach/scripts or whether the model just needs more time / different hyperparams. Given the number of examples, I can only run 1 epoch at a time before I hit the 24hr cluster wall-time limit. So I have only trained for around 30,000 steps in total across 2 epochs so far (starting from last checkpoint after 1st epoch).
All advice much appreciated!