google-research / pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Apache License 2.0
869 stars 72 forks source link

training gets stuck #1

Closed hust-nj closed 2 years ago

hust-nj commented 2 years ago

Hello, I train the model with the command

python run.py --mode=train --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.dataset.coco_annotations_dir=/home/t-liuze/coco --config.train.batch_size=32 --config.train.epochs=20 --config.optimization.learning_rate=3e-5

after the following log

I0328 17:36:25.534775 139824410076992 logging_logger.py:44] Constructing tf.data.Dataset coco for split train, from /home/t-liuze/tensorflow_datasets/coco/2017/1.1.0
I0328 17:36:25.781254 139824410076992 coco.py:174] Loading annotations from /home/t-liuze/coco/instances_train2017.json
I0328 17:36:51.126714 139824410076992 utils.py:189] Restoring from latest checkpoint: gs://pix2seq/obj365_pretrain/vit_b_640x640_b256_s400k/ckpt-400000
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.118989 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.120561 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.123781 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.124701 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.128313 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.129202 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.131696 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.132594 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.136107 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0328 17:36:58.136971 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2022-03-28 17:36:58.212910: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a final `prefetch` in the input pipeline to which to introduce slack.

The program gets stuck

I have tested single gpu or uncomment

    cross_device_ops = None  # tf.distribute.NcclAllReduce() by default
    # if the default cross_device_ops fails, try either of the following two
    # by uncommenting it.
    # cross_device_ops = tf.distribute.HierarchicalCopyAllReduce()
    # cross_device_ops = tf.distribute.ReductionToOneDevice()

The program still gets stuck.

Here is my tf version:

tf-docker ~ > pip list | grep tensorflow
tensorflow                   2.7.1
tensorflow-addons            0.16.1
tensorflow-datasets          4.5.2
tensorflow-estimator         2.7.0
tensorflow-io-gcs-filesystem 0.23.1
tensorflow-metadata          1.7.0
chentingpc commented 2 years ago

Have you monitored the GPU utilization (e.g., using command nvidia-smi)? It may look stuck but actually running, the logging is done every epoch by default (you could change frequency though) so it may take a while to show progress. You could also set up a tensorboard which should update status a bit more frequently.

On Mon, Mar 28, 2022 at 2:05 PM ninja @.***> wrote:

Hello, I train the model with the command

python run.py --mode=train --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.dataset.coco_annotations_dir=/home/t-liuze/coco --config.train.batch_size=32 --config.train.epochs=20 --config.optimization.learning_rate=3e-5

after the following log

I0328 17:36:25.534775 139824410076992 logging_logger.py:44] Constructing tf.data.Dataset coco for split train, from /home/t-liuze/tensorflow_datasets/coco/2017/1.1.0 I0328 17:36:25.781254 139824410076992 coco.py:174] Loading annotations from /home/t-liuze/coco/instances_train2017.json I0328 17:36:51.126714 139824410076992 utils.py:189] Restoring from latest checkpoint: gs://pix2seq/obj365_pretrain/vit_b_640x640_b256_s400k/ckpt-400000 INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.118989 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.120561 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.123781 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.124701 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.128313 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.129202 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.131696 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.132594 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.136107 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). I0328 17:36:58.136971 139824410076992 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 2022-03-28 17:36:58.212910: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a final prefetch in the input pipeline to which to introduce slack.

The program gets stuck

I have tested single gpu or uncomment

cross_device_ops = None  # tf.distribute.NcclAllReduce() by default
# if the default cross_device_ops fails, try either of the following two
# by uncommenting it.
# cross_device_ops = tf.distribute.HierarchicalCopyAllReduce()
# cross_device_ops = tf.distribute.ReductionToOneDevice()

The program still gets stuck.

Here is my tf version:

tf-docker ~ > pip list | grep tensorflow tensorflow 2.7.1 tensorflow-addons 0.16.1 tensorflow-datasets 4.5.2 tensorflow-estimator 2.7.0 tensorflow-io-gcs-filesystem 0.23.1 tensorflow-metadata 1.7.0

— Reply to this email directly, view it on GitHub https://github.com/google-research/pix2seq/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUJWHKQFXZHORK5MSYDVCHYFVANCNFSM5R3556OQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

hust-nj commented 2 years ago

image GPU utilization is always 0%, I use 4 v100 gpus.

chentingpc commented 2 years ago

It may be that pretrained checkpoints in cloud is blocking it, you can download them manually with following command gsutil cp -r gs://cloud_folder local_folder, and update pretrained_ckpt in the config file accordingly

hust-nj commented 2 years ago

Solved by downloading checkpoint manually, Thank you!

hust-nj commented 2 years ago

Hi, I can train model normally without --run_eagerly option, but if append --run_eagerly on my command, the training still gets stuck at the start. It gets stuck after logging

I0330 14:15:32.961766 139646374647552 dataset_info.py:439] Load dataset info from /mnt/unsup/tensorflow_datasets/coco/2017/1.1.0
I0330 14:15:32.964545 139646374647552 dataset_builder.py:369] Reusing dataset coco (/mnt/unsup/tensorflow_datasets/coco/2017/1.1.0)
I0330 14:15:32.964972 139646374647552 logging_logger.py:44] Constructing tf.data.Dataset coco for split train, from /mnt/unsup/tensorflow_datasets/coco/2017/1.1.0
/mnt/unsup/miniconda3/envs/pix2seq-tf/lib/python3.8/site-packages/tensorflow/python/data/ops/structured_function.py:264: UserWarning: Even though the `tf.config.experimental_run_functions_eagerly` option is set, this option does not apply to tf.data functions. To force eager execution of tf.data functions, please use `tf.data.experimental.enable_debug_mode()`.
  warnings.warn(
I0330 14:15:33.334566 139646374647552 coco.py:174] Loading annotations from /mnt/unsup/data/coco/annotations/instances_train2017.json
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.731360 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.733140 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.737524 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.738681 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.743675 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.744870 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.748851 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.750024 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.755101 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0330 14:15:51.756291 139646374647552 cross_device_ops.py:616] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2022-03-30 14:15:51.856368: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a final `prefetch` in the input pipeline to which to introduce slack.
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
W0330 14:15:52.862900 139646374647552 mirrored_run.py:85] Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
2022-03-30 14:15:53.472710: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2022-03-30 14:15:54.365601: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] PERMISSION_DENIED: /tmp/tempfile-x1-612ee5d2-30539-5db697a5aae33; Permission denied
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2022-03-30 14:15:55.341206: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2022-03-30 14:15:56.489568: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
2022-03-30 14:15:57.725288: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8201
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0330 14:15:58.575618 139646374647552 cross_device_ops.py:897] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1

I want to debug with eager mode, how to solve it? thank you! GPU utilization is 0 image

chentingpc commented 2 years ago

This doesn't happen to me when training on GPUs. Have you tried the suggestion on the readme page?

(Optional) If training fails at the start (due to NcclAllReduce error), try a different cross_device_ops for tf.distribute.MirroredStrategy in utils.py:build_strategy function.

hust-nj commented 2 years ago

This doesn't happen to me when training on GPUs. Have you tried the suggestion on the readme page?

(Optional) If training fails at the start (due to NcclAllReduce error), try a different cross_device_ops for tf.distribute.MirroredStrategy in utils.py:build_strategy function.

Using cross_device_ops = tf.distribute.HierarchicalCopyAllReduce() or cross_device_ops = tf.distribute.ReductionToOneDevice() with eager mode, It continuously prints the following log, and the GPU utilization is very low, is it normal? image

chentingpc commented 2 years ago

We mainly use eager mode for debugging and use non-eager mode for actual training, so in your case, it's running (and printing parameters every step), but you may want to remove --run_eagerly for better efficiency after you're sure there's no bug. It may be possible to make --run_eagerly more efficient for actual train, but we didn't try it.

Haiyang-W commented 1 year ago

This doesn't happen to me when training on GPUs. Have you tried the suggestion on the readme page?

(Optional) If training fails at the start (due to NcclAllReduce error), try a different cross_device_ops for tf.distribute.MirroredStrategy in utils.py:build_strategy function.

Using cross_device_ops = tf.distribute.HierarchicalCopyAllReduce() or cross_device_ops = tf.distribute.ReductionToOneDevice() with eager mode, It continuously prints the following log, and the GPU utilization is very low, is it normal? image

I had the same problem. Did you succeed?