renqianluo / NAO

Neural Architecture Optimization
GNU General Public License v3.0
286 stars 66 forks source link

train_search.sh 'Caused by op 'child_1/stem_conv/Conv2D'' #14

Closed zihaozhang9 closed 5 years ago

zihaozhang9 commented 5 years ago

tensorflow 1.13.1 pytorch 0.4.1

I am running the code: cd NAO-WS/cnn bash train_final.sh log: `2019-06-19 07:42:32.452404: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2019-06-19 07:42:32.479993: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[{{node child_1/stem_conv/Conv2D}}]] [[{{node child_2/gradients/concat_14}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_search.py", line 382, in tf.app.run(argv=[sys.argv[0]] + unparsed) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_search.py", line 376, in main train() File "train_search.py", line 214, in train child_epoch = child_train(child_params) File "/NAO/NAO-WS/cnn/model_search.py", line 1142, in train loss, lr, gn, tracc, = sess.run(run_ops) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(original_exc_info) File "/opt/conda/lib/python3.6/site-packages/six.py", line 693, in reraise raise value File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(args, *kwargs) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node child_1/stem_conv/Conv2D (defined at /NAO/NAO-WS/cnn/model_search.py:529) ]] [[node child_2/gradients/concat_14 (defined at /NAO/NAO-WS/cnn/utils.py:62) ]]

Caused by op 'child_1/stem_conv/Conv2D', defined at: File "train_search.py", line 382, in tf.app.run(argv=[sys.argv[0]] + unparsed) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "train_search.py", line 376, in main train() File "train_search.py", line 214, in train child_epoch = child_train(child_params) File "/NAO/NAO-WS/cnn/model_search.py", line 1118, in train child_ops = get_ops(images, labels, params) File "/NAO/NAO-WS/cnn/model_search.py", line 1097, in get_ops child_model.connect_controller(params['arch_pool'], params['arch_pool_prob']) File "/NAO/NAO-WS/cnn/model_search.py", line 1059, in connect_controller self._build_train() File "/NAO/NAO-WS/cnn/model_search.py", line 980, in _build_train logits = self._model(self.x_train, is_training=True, reuse=tf.AUTO_REUSE) File "/NAO/NAO-WS/cnn/model_search.py", line 529, in _model images, w, [1, 1, 1, 1], "SAME", data_format=self.data_format) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d data_format=data_format, dilations=dilations, name=name) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node child_1/stem_conv/Conv2D (defined at /NAO/NAO-WS/cnn/model_search.py:529) ]] [[node child_2/gradients/concat_14 (defined at /NAO/NAO-WS/cnn/utils.py:62) ]] `

renqianluo commented 5 years ago

@zihaozhang9 hi, maybe you can try the pytorch version