nginyc / rafiki

Rafiki is a distributed system that supports training and deployment of machine learning models using AutoML, built with ease-of-use in mind.
Apache License 2.0
36 stars 23 forks source link

Error 'cudnn PoolForward launch failed' when doing average pooling #154

Open vivansxu opened 5 years ago

vivansxu commented 5 years ago

When I implement PGGANs model to the dev branch, I faced 'cudnn PoolForward launch failed' error when doing average pooling.

I noticed that this error might be caused by GPU allocation. I got the same error when implementing PGGANs with the master branch, and it has been solved by decreasing the minibatch size and add sleep time before doing pooling. However, these solutions would not work for the dev branch.

The model works well for test_model_class, and there is also no error when I run test_model_class manually in the worker image, thus I think this error might not be caused by the environment problem.

Besides, I noticed that for my model with dev branch, no matter running it in rafiki or just test_model_class, all GPU memory is allocated at the beginning of training a trial. However, I have already setup tf.ConfigProto().gpu_options.allow_growth=True, which can allocate only as much GPU memory as needed. My model with master branch does not have this problem, thus I am not sure if this is the reason for the error I have.

Thank you so much for your help!

Following is error trace:

Traceback (most recent call last): File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(args) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed [[{{node GPU0/D_loss/D/cond/Downscale2D/AvgPool}} = AvgPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/root/rafiki/worker/train.py", line 111, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 167, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, (train_args or {})) File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 599, in train File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 910, in _train_progressive_gan File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: cudnn PoolForward launch failed [[node GPU0/D_loss/D/cond/Downscale2D/AvgPool (defined at /root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py:534) = AvgPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] Caused by op 'GPU0/D_loss/D/cond/Downscale2D/AvgPool', defined at: File "scripts/start_worker.py", line 58, in run_worker(meta_store, start_worker, stop_worker) File "/root/rafiki/utils/service.py", line 50, in run_worker start_worker(service_id, service_type, container_id) File "scripts/start_worker.py", line 40, in start_worker worker.start() File "/root/rafiki/worker/train.py", line 68, in start result = self._perform_trial(proposal) File "/root/rafiki/worker/train.py", line 111, in _perform_trial self._train_model(model_inst, proposal, shared_params) File "/root/rafiki/worker/train.py", line 167, in _train_model model_inst.train(train_dataset_path, shared_params=shared_params, (train_args or {})) File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 599, in train File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 875, in _train_progressive_gan File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 1382, in _D_wgangp_acgan File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 233, in get_output_for File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 440, in D_paper File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 436, in grow File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 555, in File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, *kwargs) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2097, in cond orig_res_f, res_f = context_f.BuildCondBranch(false_fn) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1930, in BuildCondBranch original_result = fn() File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 433, in File "/root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py", line 534, in _downscale2d File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/nn_ops.py", line 2110, in avg_pool name=name) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 72, in avg_pool data_format=data_format, name=name) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(args, **kwargs) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/usr/local/envs/rafiki/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack() InternalError (see above for traceback): cudnn PoolForward launch failed [[node GPU0/D_loss/D/cond/Downscale2D/AvgPool (defined at /root/PG_GANs-b250f22d-0c98-4dac-8e19-8c39ae7af345.py:534) = AvgPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 8, 8], padding="VALID", strides=[1, 1, 8, 8], _device="/job:localhost/replica:0/task:0/device:GPU:0"]] [[{{node TrainD/ApplyGrads0/UpdateWeights/cond/pred_id/_921}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_14500_TrainD/ApplyGrads0/UpdateWeights/cond/pred_id", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

nginyc commented 5 years ago

Thanks for the bug report @vivansxu!

An environment discrepancy I can think of is that on the master branch and on your local development environment, the CUDA_VISIBLE_DEVICES environment variable isn't set, while for the dev branch, as part of the worker deployment logic, each running train worker instance would be exclusively assigned a single GPU by its GPU number with e.g. CUDA_VISIBLE_DEVICES=1. Maybe you can test if model training works locally when this environment variable is set (e.g. maybe with a non-0). As part of the model developer guide, your model should be sensitive that that environment variable.

vivansxu commented 5 years ago

Thanks for the bug report @vivansxu!

An environment discrepancy I can think of is that on the master branch and on your local development environment, the CUDA_VISIBLE_DEVICES environment variable isn't set, while for the dev branch, as part of the worker deployment logic, each running train worker instance would be exclusively assigned a single GPU by its GPU number with e.g. CUDA_VISIBLE_DEVICES=1. Maybe you can test if model training works locally when this environment variable is set (e.g. maybe with a non-0). As part of the model developer guide, your model should be sensitive that that environment variable.

@nginyc Thank you for your reply! Actually every time when I run the test_model_class, I did export CUDA_VISIBLE_DEVICES. Today I also noticed that right before this error occurred, the GPU memory reached 11014/11019 mb. I am not sure if this is actually an out of memory problem. I am now trying to deploy the dev branch on AWS using a V100. However, I am not sure if this could solve the problem, because rafiki tries to allocate almost all GPU memory at the beginning, and the remain GPU memory is less than 1G, which could not support the following pooling operation.

Thank you!

ShiraStarL commented 3 years ago

I had the same problem Solve it like this:

Check CUDA version with nvidia-smi Result was 11.0

Check CUDA version using nvcc --version Result was 10.0

cd usr/local rm cuda ln -s cuda-11.0 cuda

It turns out that CUDA symbolic link wasn't correct, it pointed to CUDA 10.0 instead of CUDA 11.0

multicuda-multiple-versions-of-cuda-on-one-machine