helizabeth1103 commented 6 months ago

Hello,

I have followed along with the advanced training case study, and I believe I was successful in training a model (at least, there were no errors thrown in that step that I could see). I am using one chromosome for the training set, one for validation, and one for testing the model. I am running this remotely on a cluster using apptainer and was able to specify a gpu node for the training step.

When I went to test the model, my script at first appears to run fine, but it seems when it hits the call_variants step, it throws a warning, after which it does not fail but also does not progress--just stays stagnant. The main issue seems to be that my "input shape and model shape do not match," but I'm not sure functionally what that means I need to fix or where I went wrong. Any suggestions on how to resolve this would be very much appreciated! Below is the code I used to train the model, and then to test the model, as well as the error code thrown when testing the mode. I will also attach the output file as a whole so you can see exactly where it stops.

Thank you so much for any insight!

Best, Haley

deepvariant_modeltest-14698718-Atlas-0021.out.txt

Code to train the model: `#!/bin/bash

SBATCH -p atlas

SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)

SBATCH --nodes=1 # number of nodes

SBATCH --gpus-per-node=1 # 20 processor core(s) per node X 2 threads per core

SBATCH --partition=gpu # standard node(s)

SBATCH --ntasks=48

SBATCH --job-name="deepvariant_training"

SBATCH --mail-user=haley.arnold@usda.gov # email address

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

SBATCH --output="deepvariant_modeltrain-%j-%N.out" # job standard output file (%j replaced by job id)

SBATCH --error="deepvariant_modeltrain-%j-%N.err" # job standard error file (%j replaced by job id)

SBATCH --account=ag100pest

LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE

export PATH=$PATH:/project/ag100pest/sratoolkit/sratoolkit.2.10.9-centos_linux64/bin export PATH=$PATH:/project/ag100pest/sheina.sim/software/miniconda3/bin

export SINGULARITY_CACHEDIR=$TMPDIR export SINGULARITY_TMPDIR=$TMPDIR

condapath=/project/ag100pest/sheina.sim/condaenvs softwarepath=/project/ag100pest/sheina.sim/software slurmpath=/project/ag100pest/sheina.sim/slurm_scripts

module load apptainer

apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/train \ --config=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/dv_config.py:base \ --config.train_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_set.pbtxt" \ --config.tune_dataset_pbtxt="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.pbtxt" \ --config.init_checkpoint=gs://deepvariant/models/DeepVariant/1.6.1/checkpoints/wgs/deepvariant.wgs.ckpt \ --config.num_epochs=10 \ --config.learning_rate=0.0001 \ --config.num_validation_examples=0 \ --experiment_dir="/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2" \ --strategy=mirrored \ --config.batch_size=512 `

Code to test the custom model:

`#!/bin/bash

SBATCH -p atlas

SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)

SBATCH --nodes=1 # number of nodes

SBATCH --ntasks-per-node=1 # 20 processor core(s) per node X 2 threads per core

SBATCH --partition=atlas # standard node(s)

SBATCH --job-name="deepvariant_modeltest"

SBATCH --mail-user=haley.arnold@usda.gov # email address

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

SBATCH --output="deepvariant_modeltest-%j-%N.out" # job standard output file (%j replaced by job id)

SBATCH --error="deepvariant_modeltest-%j-%N.err" # job standard error file (%j replaced by job id)

SBATCH --account=ag100pest

LOAD MODULES, INSERT CODE, AND RUN YOUR PROGRAMS HERE

export PATH=$PATH:/project/ag100pest/sratoolkit/sratoolkit.2.10.9-centos_linux64/bin export PATH=$PATH:/project/ag100pest/sheina.sim/software/miniconda3/bin

export SINGULARITY_CACHEDIR=$TMPDIR export SINGULARITY_TMPDIR=$TMPDIR

condapath=/project/ag100pest/sheina.sim/condaenvs softwarepath=/project/ag100pest/sheina.sim/software slurmpath=/project/ag100pest/sheina.sim/slurm_scripts

module load apptainer

apptainer exec deepvariant_1.6.0.sif /opt/deepvariant/bin/run_deepvariant \ --model_type WGS \ --customized_model "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58" \ --ref "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/idBacDors_rearing_male_chr_unpl_mt.fasta" \ --reads "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/DTWP-03_F1_M1_Chromosome4_sorted.bam" \ --regions "Chromosome4" \ --output_vcf "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/modeltestset2_n.vcf.gz"`

Warning/Error Code:

` warnings.warn( I0327 22:12:06.039550 139725850806080 call_variants.py:471] Total 1 writing processes started. I0327 22:12:06.051199 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. I0327 22:12:06.052814 139725850806080 call_variants.py:506] Shape of input examples: [100, 221, 7] I0327 22:12:06.053915 139725850806080 call_variants.py:510] Use saved model: True I0327 22:12:15.247638 139725850806080 dv_utils.py:365] From /90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58/example_info.json: Shape of input examples: [100, 221, 6], Channels of input examples: [1, 2, 3, 4, 5, 6]. I0327 22:12:15.248034 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19]. W0327 22:12:15.248203 139725850806080 call_variants.py:541] Input shape [100, 221, 7] and model shape [100, 221, 6] does not match. W0327 22:12:15.248327 139725850806080 call_variants.py:549] Input channels [1, 2, 3, 4, 5, 6, 19] and model channels [1, 2, 3, 4, 5, 6] do not match. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1483, in _call_impl return self._call_with_structured_signature(args, kwargs, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1561, in _call_with_structured_signature self._structured_signature_check_missing_args(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1581, in _structured_signature_check_missing_args raise TypeError(f"{self._structured_signature_summary()} missing " TypeError: signature_wrapper(*, input_1) missing required arguments: input_1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 633, in app.run(main) File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/absl_py/absl/app.py", line 312, in run _run_main(main, args) File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/absl_py/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 618, in main call_variants( File "/local/scratch/haley.arnold/14698718/Bazel.runfiles_xx0yuppt/runfiles/com_google_deepvariant/deepvariant/call_variants.py", line 570, in call_variants predictions = model.signatures'serving_default' File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1474, in call return self._call_impl(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1487, in _call_impl return self._call_with_flat_signature(args, kwargs, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1541, in _call_with_flat_signature return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 138, in _call_flat return super(_WrapperFunction, self)._call_flat(args, captured_inputs, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 1745, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/polymorphic_function/monomorphic_function.py", line 378, in call outputs = execute.execute( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

input depth must be evenly divisible by filter depth: 7 vs 6 [[{{node StatefulPartitionedCall/inceptionv3/activation/Relu}}]] [Op:__inference_signature_wrapper_14413]`

lucasbrambrink commented 6 months ago

The issue stems from a mismatch between the set of channels the model was trained on and the channels in the examples generated during run_deepvariant.

The important bit in the logs you posted is:

From /90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/training_dir_test2/checkpoints/ckpt-58/example_info.json: Shape of input examples: [100, 221, 6], Channels of input examples: [1, 2, 3, 4, 5, 6].
I0327 22:12:15.248034 139725850806080 dv_utils.py:365] From /local/scratch/haley.arnold/14698718/tmpg5h0cte0/make_examples.tfrecord-00000-of-00001.gz.example_info.json: Shape of input examples: [100, 221, 7], Channels of input examples: [1, 2, 3, 4, 5, 6, 19].
W0327 22:12:15.248203 139725850806080 call_variants.py:541] Input shape [100, 221, 7] and model shape [100, 221, 6] does not match.
W0327 22:12:15.248327 139725850806080 call_variants.py:549] Input channels [1, 2, 3, 4, 5, 6, 19] and model channels [1, 2, 3, 4, 5, 6] do not match.

Your customized model was trained on [1, 2, 3, 4, 5, 6] (the BASE_CHANNELS) but the examples in make_examples.tfrecord-00000-of-00001.gz have an extra channel, 19 (insert_size), which gets added to the WGS model preset.

You can either: a) include --channels "insert_size" when generating the training data b) don't set --model_type WGS when you call run_deepvariant (which you may not need to do regardless if you provide a customized_model).

The choice comes down to if you want to include the channel or not. Experiments have shown it provides a slight accuracy boost for WGS, but its not strictly necessary.

helizabeth1103 commented 6 months ago

Hello, thank you for your reply!

Not setting --model_type WGS led to an error. To clarify, when you say generating the training data, you're referring to including --channels "insert_size" in the make_examples steps for the training and validation sets, correct? Or do you mean the step where the custom model is trained?

Thank you!

Haley

lucasbrambrink commented 6 months ago

That's right, channels are set during make_examples when generating training and validation sets. The model will then use those during training.

helizabeth1103 commented 6 months ago

Thank you! I re-ran the training and validation sets with that flag, and re-shuffled them. Now, however, when I go to train the model (using the same parameters as the example case study--I just want to test out the process) I'm not getting any checkpoints in the output training directory, just the event log and the json file. What does this mean? Is the training step failing, or do I simply need to adjust my parameters? Thank you!

lucasbrambrink commented 6 months ago

1) Can you confirm that you have generated training/validation data? e.g, run

gsutil cat "${OUTPUT_BUCKET}/training_set.dataset_config.pbtxt"

and

cat "${OUTPUT_DIR}/validation_set.dataset_config.pbtxt"

2) What do you see in the ${LOG_DIR}/train.log file?

helizabeth1103 commented 6 months ago

Yes, I definitely got each pbtxt file. Attached below are the log files from the model train step. When I ran this step before (when I had not used the --channels flag, and could not test the model), the .err file for the model training step looked as though it reached a stopping point, whereas in this run it looks like it simply stopped and did not reach that same point. It's definitely not a timeout issue, but I'm not sure what's causing it.

The pbtxt file for the validation set (training set looks similar) looks like this:

 # Generated by shuffle_tfrecords_beam.py
#
# --input_pattern_list=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channlesize.tfrecord.gz
# --output_pattern_prefix=/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channelsize.shuffled
#

name: "Chromosome3"
tfrecord_path: "/90daydata/pbarc/haley.arnold/AI_Model_Training/Samples/deepvariant_output/validation_set.with_label_channelsize.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 35759
# class1: 27257
# class0: 1777
# class2: 6725

And here are the log files from the attempted model training: deepvariant_modeltrain-14705863-Atlas-0031.err.txt deepvariant_modeltrain-14705863-Atlas-0031.out.txt

Thank you for your help!

Best, Haley

helizabeth1103 commented 6 months ago

Not sure what changed, but I ran the same code again and it produced a checkpoint output file this time! Is the model_eval step no longer necessary? I see it in the 1.5 version of the case study documentation but not in the 1.6 version. Thank you!

lucasbrambrink commented 6 months ago

Great! The logs you posted confirmed that the checkpoints were not being written, but it's not clear why that was the case. I will close this issue for now, but please don't hesitate to reopen if you encounter it again!

To your second question, that's correct! In 1.6, we migrated our training and inference platform from Slim to Keras, and as part of this effort we combined model_train and model_eval with a single executable train to make training easier.

danielecook commented 6 months ago

@helizabeth1103 @lucasbrambrink I'll clear up some confusion real quick here - the updated training script will only output checkpoints if tune performance outperforms existing performance.

If you look closely in the log file you can see this line:

I0401 03:09:48.932735 140045983049536 train.py:471] Skipping checkpoint with tune/f1_weighted=0.83932966 < previous best tune/f1_weighted=0.8400078

Which states that checkpointing is being skipped because the performance was worse.

So in general, if you aren't seeing checkpoints you likely need to adjust parameters or train for longer.

helizabeth1103 commented 5 months ago

Thanks for clearing that up! I appreciate it. I did use hap.py to compare the customized model to the WGS model and it appears to have performed slightly worse, so I'll keep this in mind for future tests.

google / deepvariant

Issue testing custom model #797

SBATCH -p atlas

SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)

SBATCH --nodes=1 # number of nodes

SBATCH --gpus-per-node=1 # 20 processor core(s) per node X 2 threads per core

SBATCH --partition=gpu # standard node(s)

SBATCH --ntasks=48

SBATCH --job-name="deepvariant_training"

SBATCH --mail-user=haley.arnold@usda.gov # email address

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

SBATCH --output="deepvariant_modeltrain-%j-%N.out" # job standard output file (%j replaced by job id)

SBATCH --error="deepvariant_modeltrain-%j-%N.err" # job standard error file (%j replaced by job id)

SBATCH --account=ag100pest

SBATCH -p atlas

SBATCH --time=48:00:00 # walltime limit (HH:MM:SS)

SBATCH --nodes=1 # number of nodes

SBATCH --ntasks-per-node=1 # 20 processor core(s) per node X 2 threads per core

SBATCH --partition=atlas # standard node(s)

SBATCH --job-name="deepvariant_modeltest"

SBATCH --mail-user=haley.arnold@usda.gov # email address

SBATCH --mail-type=BEGIN

SBATCH --mail-type=END

SBATCH --mail-type=FAIL

SBATCH --output="deepvariant_modeltest-%j-%N.out" # job standard output file (%j replaced by job id)

SBATCH --error="deepvariant_modeltest-%j-%N.err" # job standard error file (%j replaced by job id)

SBATCH --account=ag100pest