tensorflow / models

Models and examples built with TensorFlow
Other
77.05k stars 45.77k forks source link

Retinanet evaluation spikes memory usage on TPUs, crashes training #10528

Closed jacob-zietek closed 1 year ago

jacob-zietek commented 2 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/train.py https://github.com/tensorflow/models/blob/r2.8.0/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml

2. Describe the bug

There are exponentially increasing memory spikes on TPUs during the training and evaluation of Retinanet, this eventually causes a crash during training. This bug was found while working on the beta project yolov4-tiny. I observed that training needed to be restarted frequently, and there were thousands of lines of...

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/resource_variable_ops.py", line 290, in __del__
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py", line 257, in destroy_resource_op
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
tensorflow.python.framework.errors_impl.AbortedError: Unable to find a context_id matching the specified one (12469621923045235436). Perhaps the worker was restarted, or the context was GC'd? [Op:DestroyResourceOp]
Exception ignored in: <function EagerResourceDeleter.__del__ at 0x7f58d152dc80>

in every training log file. Retinanet has the same issue. This crashing was observed on both v3-8 and v2-256 TPUs.

This bug is apparent looking at both the training output file (stderr and stdout) and the GCP TPU Dashboard charts. The output file shows the crash happening during evaluation, and the spikes in memory on the GCP TPU Dashboard occur only during evaluation. I provide the logs and pictures of the TPU memory usage in additional content. This bug was observed in "train_and_eval" mode, it does not occur in "train" mode.

3. Steps to reproduce

Create a v3-8 TPU with version 2.8.0.

Load and SSH into a new GCP Compute Engine VM with the disk image Debian GNU/Linux 10 Buster + TF 2-8-0.

git clone https://github.com/tensorflow/models.git cd models git checkout r2.8.0 pip3 install -r official/requirements.txt

Install the COCO dataset, I used a GCP Bucket to store mine.

Modify the official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml config to use your COCO dataset instead of tfds (I did this because it was already installed in one of my buckets). It should look like...

runtime:
  distribution_strategy: 'tpu'
  mixed_precision_dtype: 'bfloat16'
task:
  annotation_file: ''  # Can't use annotation file when tfds is used.
  losses:
    l2_weight_decay: 0.0001
  model:
    num_classes: 91
    max_level: 7
    min_level: 3
    input_size: [640, 640, 3]
    norm_activation:
      activation: relu
      norm_epsilon: 0.001
      norm_momentum: 0.99
      use_sync_bn: true
  train_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'train'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 256
    input_path: 'gs://cam2-datasets/coco/train*'
    is_training: true
    shuffle_buffer_size: 1000
  validation_data:
          #tfds_name: 'coco/2017'
          #tfds_split: 'validation'
    drop_remainder: true
    dtype: bfloat16
    global_batch_size: 8
    input_path: 'gs://cam2-datasets/coco/val*'
    is_training: false

In ~/models run the training script...

nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir={MODEL_DIR_HERE} --config_file=~/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu={TPU_NAME_HERE} > ../retinanet.txt &

mine looked like...

nohup python3 -m official.vision.beta.train --mode=train_and_eval --experiment=retinanet_resnetfpn_coco --model_dir=gs://cam2-models/new-yolov4-tiny/retinanet/ --config_file=/home/cam2tensorflow/working/models/official/vision/beta/configs/experiments/retinanet/resnet50fpn_coco_tfds_tpu.yaml --tpu=tf-yolo-1 > ../retinanet.txt &

You will see in ../retinanet.txt that the training crashes with errors, and on the TPU monitoring you should see exponentially increasing spikes in memory usage during evaluation.

4. Expected behavior

The training should run all the way through without crashing due to memory issues on the TPU.

5. Additional context

TPU Memory Usage in dashboard Retinanet training log after one crash

6. System information

saberkun commented 2 years ago

@allenwang28 @gagika this might be interesting to Google Cloud TPU team.

allenwang28 commented 2 years ago

Thanks for raising this. I actually don't think this is due to OOM - I tried to query this node for any OOMs but I couldn't find anything. On the other hand, I do see some TPU node repairs happened in the backend. We've historically seen that this can affect the way that the UI reports the memory (e.g. reporting multiple times).

If there's another instance where this happens (the logs you provided are perfect) I'm glad to take another look!

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

jacob-zietek commented 2 years ago

Hi Allen, thanks for looking into it.

That is strange, this has been a repeat issue for me since February, so I don't think TPU repairs are the issue. I've attached all of my training logs that include "tensorflow.python.framework.errors_impl.UnavailableError: Unable to find a context_id matching the specified one (18219622181471228192). Perhaps the worker was restarted, or the context was GC'd? [Op:DestroyResourceOp]" below in a zip file.

I am unable to train on TPUs without consistent crashing every 5-6 evaluation cycles. I believe this issue lies somewhere within TensorFlow and not within the YOLO beta project itself because this problem is reproducible on RetinaNet using the steps above. It's possible this has something to do with coco_evaluator and annotation files (although I'm stumped as to why this would lead to seemingly exponentially increasing memory allocation to that degree).

Every time the crashes happen I have memory usage charts that match (TPU Memory Usage in dashboard) the charts I showed previously. It's also worth mentioning that every time this happens I need to stop and start the TPU to get it to work again. Perhaps it's not tripping a OOM somewhere?

Please let me know if there's anything I can do to track down or help fix this issue 😃

https://drive.google.com/file/d/12SeuC_FckSLkcl3cxInQRMScgtt0TNZM/view?usp=sharing

allenwang28 commented 2 years ago

Thanks for sharing the other logs, they are very helpful! We've done some digging and found at least one OOM, so it's likely that your errors were a mixture of OOM and repairs.

I have a way to view what's on the heap on my side, but unfortunately you might have been using an earlier release candidate so the tool isn't working for me. Could you please try this again on TF 2.8.0 again and reply with newer logs again when you see this again? Meanwhile we'll try and reproduce this on our end as well.

Reopening this as we've confirmed an OOM.

jacob-zietek commented 2 years ago

Hi Allen. Thanks for looking into it further! I will run more experiments on TF 2.8.0 soon to collect logs (All of the YOLO experiments were in TF 2.7.0, and the retinanet was in TF 2.8.0). The steps I mentioned above should reproduce the issue.

jacob-zietek commented 2 years ago

Hi Allen, here is a new log with 2 crashes. This was done using the steps left in the original issue on a fresh VM on TF 2.8.0. Please let me know if you need more logs, I will keep training this model.

https://drive.google.com/file/d/1xl1-2DfW3NIPY6gcYfQuXKwJq5e2xCL3/view?usp=sharing

laxmareddyp commented 1 year ago

Hi @jacob-zietek ,

Thanks for sharing the logs,Is this issue still reproducible?

Thanks.

jacob-zietek commented 1 year ago

Hi @laxmareddyp,

This issue was resolved sometime in late April 2022. I believe the issue was with an experimental AUTOTUNE feature which blew up the buffer size. Unfortunately, I did not save the commit that fixed it. @arashwan might have it saved somewhere.

laxmareddyp commented 1 year ago

Hi @jacob-zietek,

Thanks for response, can we close the issue here?

Thanks.

jacob-zietek commented 1 year ago

Sure thing, thank you.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No