Child Process causes a CUDA Problem

staeff777 commented 3 years ago

I'm trying a multi-gpu legacy model training using child processes to avoid the memory leak.

python automl/efficientdet/main.py \
--model_dir=$TRAIN_DIR \
--train_batch_size=$TRAIN_BATCH \
--eval_batch_size=$TRAIN_BATCH \
--training_file_pattern=$TRAIN_FILE \
--validation_file_pattern=$VAL_FILE \
--num_examples_per_epoch=$EXAMPLES_PER_EPOCH \
--num_epochs=$EPOCHS \
--mode=train_and_eval \
--ckpt=/input/basemodel/efficientdet-d$D \
--model_name=efficientdet-d$D \
--hparams=$yamlconf \
--eval_samples=602 \
--save_checkpoints_steps=$EXAMPLES_PER_EPOCH \
--run_epoch_in_child_process=True \
--strategy=gpus

However, having the training in a seperate process causes following error:

INFO:tensorflow:Done calling model_fn.
I1106 10:52:29.651449 139927383684864 api.py:340] Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
I1106 10:52:29.653028 139927392077568 api.py:340] Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
I1106 10:52:29.654298 139927400470272 api.py:340] Done calling model_fn.
INFO:tensorflow:Done calling model_fn.
I1106 10:52:29.655233 139927408862976 api.py:340] Done calling model_fn.
INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
I1106 10:52:29.656762 139936124057408 cross_device_ops.py:443] Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:AutoGraph could not transform <function _combine_distributed_scaffold.<locals>.<lambda> at 0x7f42e12b0620> and will run it as-is.
Cause: could not parse the source code:
      lambda scaffold: scaffold.ready_op, args=(grouped_scaffold,))
This error may be avoided by creating the lambda in a standalone statement.
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W1106 10:52:29.667324 139927408862976 ag_logging.py:146] AutoGraph could not transform <function _combine_distributed_scaffold.<locals>.<lambda> at 0x7f42e12b0620> and will run it as-is.
Cause: could not parse the source code:
      lambda scaffold: scaffold.ready_op, args=(grouped_scaffold,))
This error may be avoided by creating the lambda in a standalone statement.
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
INFO:tensorflow:Create CheckpointSaverHook.
I1106 10:52:34.129860 139936124057408 basic_session_run_hooks.py:546] Create CheckpointSaverHook.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/util.py:96: DistributedIteratorV1.initialize (from tensorflow.python.distribute.input_lib) is deprecated and will be removed in a future version.
Instructions for updating:
Use the iterator's `initializer` property instead.
W1106 10:52:59.915615 139936124057408 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/util.py:96: DistributedIteratorV1.initialize (from tensorflow.python.distribute.input_lib) is deprecated and will be removed in a future version.
Instructions for updating:
Use the iterator's `initializer` property instead.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.935137 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.938297 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.945053 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.948360 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.955206 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.958209 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.965017 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1106 10:53:05.968060 139936124057408 cross_device_ops.py:443] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Graph was finalized.
I1106 10:53:19.853717 139936124057408 monitored_session.py:246] Graph was finalized.
2020-11-06 10:53:19.854275: E tensorflow/stream_executor/cuda/cuda_driver.cc:1128] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
2020-11-06 10:53:19.854323: E tensorflow/stream_executor/cuda/cuda_driver.cc:1128] could not retrieve CUDA device count: CUDA_ERROR_NOT_INITIALIZED: initialization error
Process Process-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node cond_1/then/_16/NcclAllReduce}} with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=4, reduction="sum"]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'
     [[cond_1/then/_16/NcclAllReduce]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "automl/efficientdet/main.py", line 346, in run_train_and_eval
    max_steps=e * FLAGS.num_examples_per_epoch // FLAGS.train_batch_size)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1173, in _train_model
    return self._train_model_distributed(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1235, in _train_model_distributed
    self._config._train_distribute, input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1349, in _actual_train_model_distributed
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1507, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 604, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1038, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 749, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1231, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1236, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 902, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 669, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/session_manager.py", line 301, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1181, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node cond_1/then/_16/NcclAllReduce}} with these attrs: [shared_name="c0", T=DT_FLOAT, num_devices=4, reduction="sum"]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  device='GPU'
     [[cond_1/then/_16/NcclAllReduce]]

The problem could be, that TF is already initiated in the parent process (according to an old source ). However, it seems that I'm the only one facing that problem, so maybe there is something that I'm doing wrong?

fsx950223 commented 3 years ago

NcclAllReduce only supports GPU.

staeff777 commented 3 years ago

Yes. This is due to the missing GPU information in the Subprocess. But this does not answer the question.

fsx950223 commented 3 years ago

You should specific another GPU on subprocess or set gpu memory growth to True

fsx950223 commented 3 years ago

Unimport TensorFlow in main process also works. Set CUDA_VISIBLE_DEVICES= in main process.

staeff777 commented 3 years ago

I tried both.

I set config_proto.gpu_options.allow_growth = True. The problem remained the same. (I assume that this was the right way to to it)
Then i tried
```
if __name__ == '__main__':
      os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow.compat.v1 as tf
```
This only allows CPU detection, however the subprocess then also does not find GPUs, even if I set os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" for the subprocess. It seems, that in this way around, the session is initialized for CPUs and does no longer "look" for GPU devices.

fsx950223 commented 3 years ago

https://github.com/tensorflow/tensorflow/issues/8220#issuecomment-383462576

staeff777 commented 3 years ago

Thank you. I thought about this too. Wouldn't this be - in this case- an equivalent to restarting the training script each epoch, since TF has to be initialized in each subprocess?

Thatswhy I was asking if I'm really the only one facing this problem. At the moment I'm training with commit #66995a6 - Handle the case of grad is None. ` where the memory leak did not yet appear.

I'll also try your Keras Model again. Its probably time to change. ;-)

fsx950223 commented 3 years ago

import tensorflow
import multiprocessing
tf = tensorflow.compat.v1

config = tf.ConfigProto()

# Test with multiple threads invoking the one-shot iterator concurrently.
with tf.Graph().as_default() as g:
  results = []
  dataset = tf.data.Dataset.from_tensors([1, 2, 3]).map(lambda x: x * x)
  iterator = tf.data.Dataset.make_one_shot_iterator(dataset)
  next_element = iterator.get_next()
  def consumer_thread():
    with tf.Session(config=config, graph=g) as sess:
      print(tf.test.is_gpu_available())
      try:
        print(sess.run(next_element))
      except tf.errors.OutOfRangeError:
        print('error')

  p = multiprocessing.Process(target=consumer_thread, args=())
  p.daemon=False
  p.start()
  p.join()

  p2 = multiprocessing.Process(target=consumer_thread, args=())
  p2.daemon=False
  p2.start()
  p2.join()

Maybe this one is better.

staeff777 commented 3 years ago

Thank you. I could run your example, but I could not adapt it to the TPUEstimator. I assume it's because the TPUEstimator creates its own session..

majnas commented 3 years ago

Hi, I have the same problem when using --strategy=gpus, anybody knows the solution!!?

fitoule commented 2 years ago

I have the same issue even with only one GPU :( But it works without Child Process option so my options are

No Child Process => Memory Leak
Child Process => CUDA error on the first epoch because GPU has been taken by the main process

It was working with TensorFlow until 2.5.2 but now efficientdet require TF > 2.8 so I am totally stuck

google / automl

Child Process causes a CUDA Problem #855