uber-research / deep-neuroevolution

Deep Neuroevolution
Other
1.63k stars 301 forks source link

how to run gpu_implementation on GPU #26

Open ZaneH1992 opened 5 years ago

ZaneH1992 commented 5 years ago

When I used nvidia 1080ti, I was able to compile gym_tensorflow.so and run the exp. The env is tensorflow-gpu 1.8.0 and cuda version is 9.0. But when I switch to 2080ti, the exp run into trouble as follow:

2019-06-02 18:03:10.727082: E tensorflow/stream_executor/cuda/cuda_blas.cc:654] failed to run cuBLAS routine cublasSgemmBatched: CUBLAS_STATUS_EXECUTION_FAILED 2019-06-02 18:03:10.727109: E tensorflow/stream_executor/cuda/cuda_blas.cc:2413] Internal: failed BLAS call, see log for details Exception in thread Thread-1: Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[1,441,256], b.shape=[64,256,16], idx.shape=[1], m=441, n=16, k=256, batch_size=1 [[Node: model/Model/conv1/IndexedBatchMatMul = IndexedBatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/Model/conv1/Reshape_2, model/Model/conv1/Reshape, _arg_model/Placeholder_0_0/_37)]] [[Node: model/Identity_1/_59 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_210_model/Identity_1", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/usr/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/root/hz/deep-neuroevolution/gpu_implementation/neuroevolution/concurrent_worker.py", line 94, in _loop rews, isdone, = self.sess.run([self.rew_op, self.done_op, self.incr_counter], {self.placeholder_indices: indices}) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr)

I tried to upgrade tensorflow-gpu version to 1. 12,1.13, but under that env , gym_tensorflow.so could not be compiled. I also found a similar issue in https://github.com/qqwweee/keras-yolo3/issues/332 , but still no luck after I take action to Install patchs for cuda9, I wonder if there is a solution.

fps7806 commented 5 years ago

I think 2080ti requires CUDA 10, and we haven't tested this code with CUDA 10 yet so there might be some problems. I'll see if we can fix it and I'll let you know if we find a solution.

ssrs5566 commented 5 years ago

Great,thank you. When using cuda 10.0 and tensorflow 1.13.0, gym_tensorflow could not be compiled.

deyandyankov commented 5 years ago

any news on this? I am hitting the same issue using Tesla T4 and various versions of tensorflow-gpu. So far I have only tried CUDA 9.0 though.