Closed helizabeth1103 closed 6 months ago
The issue stems from a mismatch between the set of channels the model was trained on and the channels in the examples generated during run_deepvariant
.
The important bit in the logs you posted is:
From /90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58/example_info.json: Shape of input examples: [100, 221, 6], Channels of input examples: [1, 2, 3, 4, 5, 6].
I0327 22:12:15.248034 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
W0327 22:12:15.248203 139725850806080 call_variants.py:541] Input shape [100, 221, 7] and model shape [100, 221, 6] does not match.
W0327 22:12:15.248327 139725850806080 call_variants.py:549] Input channels [1, 2, 3, 4, 5, 6, 19] and model channels [1, 2, 3, 4, 5, 6] do not match.
Your customized model was trained on [1, 2, 3, 4, 5, 6]
(the BASE_CHANNELS
) but the examples in make_examples.tfrecord-00000-of-00001.gz
have an extra channel, 19 (insert_size
), which gets added to the WGS model preset.
You can either:
a) include --channels "insert_size"
when generating the training data
b) don't set --model_type WGS
when you call run_deepvariant
(which you may not need to do regardless if you provide a customized_model
).
The choice comes down to if you want to include the channel or not. Experiments have shown it provides a slight accuracy boost for WGS, but its not strictly necessary.
Hello, thank you for your reply!
Not setting --model_type WGS
led to an error. To clarify, when you say generating the training data, you're referring to including --channels "insert_size"
in the make_examples steps for the training and validation sets, correct? Or do you mean the step where the custom model is trained?
Thank you!
Haley
That's right, channels
are set during make_examples
when generating training and validation sets. The model will then use those during training.
Thank you! I re-ran the training and validation sets with that flag, and re-shuffled them. Now, however, when I go to train the model (using the same parameters as the example case study--I just want to test out the process) I'm not getting any checkpoints in the output training directory, just the event log and the json file. What does this mean? Is the training step failing, or do I simply need to adjust my parameters? Thank you!
1) Can you confirm that you have generated training/validation data? e.g, run
gsutil cat "${OUTPUT_BUCKET}/training_set.dataset_config.pbtxt"
and
cat "${OUTPUT_DIR}/validation_set.dataset_config.pbtxt"
2) What do you see in the ${LOG_DIR}/train.log
file?
Yes, I definitely got each pbtxt file. Attached below are the log files from the model train step. When I ran this step before (when I had not used the --channels flag, and could not test the model), the .err file for the model training step looked as though it reached a stopping point, whereas in this run it looks like it simply stopped and did not reach that same point. It's definitely not a timeout issue, but I'm not sure what's causing it.
The pbtxt file for the validation set (training set looks similar) looks like this:
# Generated by shuffle_tfrecords_beam.py
#
# --input_pattern_list=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channlesize.tfrecord.gz
# --output_pattern_prefix=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channelsize.shuffled
#
name: "Chromosome3"
tfrecord_path: "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channelsize.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 35759
# class1: 27257
# class0: 1777
# class2: 6725
And here are the log files from the attempted model training: deepvariant_modeltrain-14705863-Atlas-0031.err.txt deepvariant_modeltrain-14705863-Atlas-0031.out.txt
Thank you for your help!
Best, Haley
Not sure what changed, but I ran the same code again and it produced a checkpoint output file this time! Is the model_eval step no longer necessary? I see it in the 1.5 version of the case study documentation but not in the 1.6 version. Thank you!
Great! The logs you posted confirmed that the checkpoints were not being written, but it's not clear why that was the case. I will close this issue for now, but please don't hesitate to reopen if you encounter it again!
To your second question, that's correct! In 1.6, we migrated our training and inference platform from Slim to Keras, and as part of this effort we combined model_train
and model_eval
with a single executable train
to make training easier.
@helizabeth1103 @lucasbrambrink I'll clear up some confusion real quick here - the updated training script will only output checkpoints if tune performance outperforms existing performance.
If you look closely in the log file you can see this line:
I0401 03:09:48.932735 140045983049536 train.py:471] Skipping checkpoint with tune/f1_weighted=0.83932966 < previous best tune/f1_weighted=0.8400078
Which states that checkpointing is being skipped because the performance was worse.
So in general, if you aren't seeing checkpoints you likely need to adjust parameters or train for longer.
Thanks for clearing that up! I appreciate it. I did use hap.py to compare the customized model to the WGS model and it appears to have performed slightly worse, so I'll keep this in mind for future tests.
Hello,
I have followed along with the advanced training case study, and I believe I was successful in training a model (at least, there were no errors thrown in that step that I could see). I am using one chromosome for the training set, one for validation, and one for testing the model. I am running this remotely on a cluster using apptainer and was able to specify a gpu node for the training step.
When I went to test the model, my script at first appears to run fine, but it seems when it hits the call_variants step, it throws a warning, after which it does not fail but also does not progress--just stays stagnant. The main issue seems to be that my "input shape and model shape do not match," but I'm not sure functionally what that means I need to fix or where I went wrong. Any suggestions on how to resolve this would be very much appreciated! Below is the code I used to train the model, and then to test the model, as well as the error code thrown when testing the mode. I will also attach the output file as a whole so you can see exactly where it stops.
Thank you so much for any insight!
Best, Haley
deepvariant_modeltest-14698718-Atlas-0021.out.txt
Code to train the model: `#!/bin/bash
SBATCH -p atlas
SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)
SBATCH --nodes=1 # number of nodes
SBATCH --gpus-per-node=1 # 20 processor core(s) per node X 2 threads per core
SBATCH --partition=gpu # standard node(s)
SBATCH --ntasks=48
SBATCH --job-name="deepvariant_training"
SBATCH --mail-user=haley.arnold@usda.gov # email address
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH --mail-type=FAIL
SBATCH --output="deepvariant_modeltrain-%j-%N.out" # job standard output file (%j replaced by job id)
SBATCH --error="deepvariant_modeltrain-%j-%N.err" # job standard error file (%j replaced by job id)
SBATCH --account=ag100pest
LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
export PATH=$PATH:/project/ag100pest/sratoolkit/sratoolkit.2.10.9-centos_linux64/bin export PATH=$PATH:/project/ag100pest/sheina.sim/software/miniconda3/bin
export SINGULARITY_CACHEDIR=$TMPDIR export SINGULARITY_TMPDIR=$TMPDIR
condapath=/project/ag100pest/sheina.sim/condaenvs softwarepath=/project/ag100pest/sheina.sim/software slurmpath=/project/ag100pest/sheina.sim/slurm_scripts
module load apptainer
apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/train \ --config=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/dv_config.py:base \ --config.train_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_set.pbtxt" \ --config.tune_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.pbtxt" \ --config.init_checkpoint=gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/deepvariant.wgs.ckpt \ --config.num_epochs=10 \ --config.learning_rate=0.0001 \ --config.num_validation_examples=0 \ --experiment_dir="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2" \ --strategy=mirrored \ --config.batch_size=512 `
Code to test the custom model:
`#!/bin/bash
SBATCH -p atlas
SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)
SBATCH --nodes=1 # number of nodes
SBATCH --ntasks-per-node=1 # 20 processor core(s) per node X 2 threads per core
SBATCH --partition=atlas # standard node(s)
SBATCH --job-name="deepvariant_modeltest"
SBATCH --mail-user=haley.arnold@usda.gov # email address
SBATCH --mail-type=BEGIN
SBATCH --mail-type=END
SBATCH --mail-type=FAIL
SBATCH --output="deepvariant_modeltest-%j-%N.out" # job standard output file (%j replaced by job id)
SBATCH --error="deepvariant_modeltest-%j-%N.err" # job standard error file (%j replaced by job id)
SBATCH --account=ag100pest
LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE
export PATH=$PATH:/project/ag100pest/sratoolkit/sratoolkit.2.10.9-centos_linux64/bin export PATH=$PATH:/project/ag100pest/sheina.sim/software/miniconda3/bin
export SINGULARITY_CACHEDIR=$TMPDIR export SINGULARITY_TMPDIR=$TMPDIR
condapath=/project/ag100pest/sheina.sim/condaenvs softwarepath=/project/ag100pest/sheina.sim/software slurmpath=/project/ag100pest/sheina.sim/slurm_scripts
module load apptainer
apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/run_deepvariant \ --model_type WGS \ --customized_model "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58" \ --ref "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/idBacDors_rearing_male_chr_unpl_mt.fasta" \ --reads "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/DTWP-03_F1_M1_Chromosome4_sorted.bam" \ --regions "Chromosome4" \ --output_vcf "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/modeltestset2_n.vcf.gz"`
Warning/Error Code:
` warnings.warn( I0327 22:12:06.039550 139725850806080 call_variants.py:471] Total 1 writing processes started. I0327 22:12:06.051199 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. I0327 22:12:06.052814 139725850806080 call_variants.py:506] Shape of input examples: [100, 221, 7] I0327 22:12:06.053915 139725850806080 call_variants.py:510] Use saved model: True I0327 22:12:15.247638 139725850806080 dv_utils.py:365] From /90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58/example_info.json: Shape of input examples: [100, 221, 6], Channels of input examples: [1, 2, 3, 4, 5, 6]. I0327 22:12:15.248034 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. W0327 22:12:15.248203 139725850806080 call_variants.py:541] Input shape [100, 221, 7] and model shape [100, 221, 6] does not match. W0327 22:12:15.248327 139725850806080 call_variants.py:549] Input channels [1, 2, 3, 4, 5, 6, 19] and model channels [1, 2, 3, 4, 5, 6] do not match. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1483, in _call_impl return self._call_with_structured_signature(args, kwargs, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1561, in _call_with_structured_signature self._structured_signature_check_missing_args(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1581, in _structured_signature_check_missing_args raise TypeError(f"{self._structured_signature_summary()} missing " TypeError: signature_wrapper(*, input_1) missing required arguments: input_1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 633, in
app.run(main)
File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 618, in main
call_variants(
File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 570, in call_variants
predictions = model.signatures'serving_default'
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1474, in call
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1487, in _call_impl
return self._call_with_flat_signature(args, kwargs,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1541, in _call_with_flat_signature
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 138, in _call_flat
return super(_WrapperFunction, self)._call_flat(args, captured_inputs,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 378, in call
outputs = execute.execute(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
input depth must be evenly divisible by filter depth: 7 vs 6 [[{{node StatefulPartitionedCall/inceptionv3/activation/Relu}}]] [Op:__inference_signature_wrapper_14413]`