error in training DeepVariant

sophienguyen01 commented 2 months ago

Hi,

I followed the guide to retrain DeepVariant in here: https://github.com/google/deepvariant/blob/r1.6.1/docs/deepvariant-training-case-study.md

This is my command to retrain using the default model in s3://deepvariant/deepvariant_training/model/1.6.1_wgs_model/:

  time sudo docker run --gpus 1 \
      -v /home/${USER}:/home/${USER} \
      -w /home/${USER} \
      ${DOCKER_IMAGE}-gpu \
      train \
      --config=s3-mount/deepvariant_training/script/dv_config.py:base \
      --config.train_dataset_pbtxt="${SHUFFLE_DIR}/training_set.dataset_config.pbtxt" \
      --config.tune_dataset_pbtxt="${SHUFFLE_DIR}/validation_set.dataset_config.pbtxt" \
      --config.init_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
      --config.num_epochs=0 \
      --config.learning_rate=0.02 \
      --config.num_validation_examples=0 \
      --experiment_dir="model_train" \
      --strategy=mirrored \
      --config.batch_size=512 \
      --debug 'true'

I received an error regarding about the checkpoint: No checkpoint found. I also attached my log for training step here: train_040224_failed.log

I'm not very clear where I can get the checkpoint file. My understand is that the input for experiment_dir is created by running this training step, is that right?

lucasbrambrink commented 2 months ago

I'm seeing an OOM in the logs:

OP_REQUIRES failed at conv_ops.cc:698 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[16384,32,37,110]

It also shows your training params:

Training Examples: 8264746
Batch Size: 16384
Epochs: 1
Steps per epoch: 504
Steps per tune: 1500000
Num train steps: 504

It seems that the --config.batch_size=512 is not being picked up. It could be related to setting num_epochs=0, try changing that to the original 10. If that doesn't work, you could edit the batch_size in dv_config.py directly.

Let me know if that helps!

sophienguyen01 commented 2 months ago

Hi @lucasbrambrink,

I actually tried with different batch_size (32 and 512) but the batch_size takes longer so I switched to 512. I also tried with epoch=10 but still have encountered the same error. I just updated my error log file with the error No checkpoint found.

danielecook commented 2 months ago

@sophienguyen01 can you try to run this again without using --debug=true? This runs tensorflow in eager mode which will be very inefficient.

The other issue is that you don't have a checkpoint file because you didn't train long enough - and no checkpoint outperformed the existing performance on your tune dataset.

Try re-running with --debug=false and --config.num_epochs=10 and see where that gets you. If you get an OOM error with batch_size=512, reduce it and try again.

If training produces a better model, it will be output in the experiment_dir.

sophienguyen01 commented 2 months ago

Hi @danielecook , I tried without --debug=false and set --config.num_epochs=10 but I still get the same error that --config.num_epochs=10. I attached my log file here

train_040324.log

THis is the command I used:

BIN_VERSION="1.6.1"
DOCKER_IMAGE="google/deepvariant:${BIN_VERSION}"

time sudo docker run --gpus 1 \
    -v /home/${USER}:/home/${USER} \
    -w /home/${USER} \
    ${DOCKER_IMAGE}-gpu \
    train \
    --config=s3-mount/deepvariant_training/script/dv_config.py:base \
    --config.train_dataset_pbtxt="${SHUFFLE_DIR}/training_set.dataset_config.pbtxt" \
    --config.tune_dataset_pbtxt="${SHUFFLE_DIR}/validation_set.dataset_config.pbtxt" \
    --config.init_checkpoint="${GCS_PRETRAINED_WGS_MODEL}" \
    --config.num_epochs=10 \
    --config.learning_rate=0.02 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=512

Did I miss anything?

danielecook commented 2 months ago

@sophienguyen01 - from the log file it looks like everything worked.

Here are all the tune/categorical accuracies from your training data.

tune/categorical_accuracy=0.9944317936897278
tune/categorical_accuracy=0.9909400343894958
tune/categorical_accuracy=0.9915463924407959
tune/categorical_accuracy=0.9925118088722229
tune/categorical_accuracy=0.9921825528144836
tune/categorical_accuracy=0.9924613237380981
tune/categorical_accuracy=0.9926846623420715
tune/categorical_accuracy=0.9929667711257935
tune/categorical_accuracy=0.9925829172134399
tune/categorical_accuracy=0.9926416277885437
tune/categorical_accuracy=0.9923893213272095
tune/categorical_accuracy=0.9925225377082825

The first number represents accuracy direct from the pretrained model. Since none of the subsequent tuning evaluations outperformed the original, no checkpoints were created.

One thing you could try: reduce the learning rate, and see if that helps.

pichuan commented 2 months ago

Hi @sophienguyen01 , let me know if you have a chance to try and provide some updates here. Thanks!

sophienguyen01 commented 2 months ago

Hi Pichuan,

You can close this issue now. I will try with different samples. I tried to lower the learning rate but it still does not exceed the performance of default model.

I will have to train on different samples.

Thanks

sophienguyen01 commented 2 months ago

HI @pichuan,

I trained on a new dataset and run into similar issue. This time there are files created in checkpoint but I still get the same error. Only the first epoch has low tune/categorical_accuracy and the next remaining epoch the accuracy higher than 0.9. I attached the log file here train_041924.log

Here is the parameter I used to train:

    --config.learning_rate=0.0001 \
    --config.num_validation_examples=0 \
    --experiment_dir="model_train" \
    --strategy=mirrored \
    --config.batch_size=32 \

Would you take a look and let me know what's going wrong? Thank you

pichuan commented 2 months ago

Hi @sophienguyen01 , Is there a reason why you're setting --config.num_validation_examples=0? You'll need to have a reasonable amount of num_validation_examples for the model to be able to evaluate and pick a reasonable checkpoint.

sophienguyen01 commented 2 months ago

According to file dv_config.py :

 # If set to 0, use full validation dataset.
  config.num_validation_examples = 0

Also, the training tutorial also use --config.num_validation_examples=0

kishwarshafin commented 2 months ago

Hi @sophienguyen01 , can you specifically point out the line of the error? All the lines in the logs are API warnings, you can safely ignore those.

danielecook commented 2 months ago

@sophienguyen01 the logs indicate checkpoints are output:

I0423 18:41:59.026870 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.9114237 step=3352 epoch=1 path=model_train/checkpoints/ckpt-3352
I0423 18:44:53.215049 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.91949123 step=6704 epoch=2 path=model_train/checkpoints/ckpt-6704
I0423 18:47:47.292658 139913113728832 train.py:456] Saved checkpoint tune/f1_weighted=0.92320794 step=10056 epoch=3 path=model_train/checkpoints/ckpt-10056

But as @kishwarshafin suggests, the warnings at the end are normal and can be ignored.

sophienguyen01 commented 2 months ago

Thank you for your input, I am able to find the checkpoints with this training.

google / deepvariant

error in training DeepVariant #802