mrahtz / learning-from-human-preferences

Reproduction of OpenAI and DeepMind's "Deep Reinforcement Learning from Human Preferences"
MIT License
301 stars 67 forks source link

GRPC error #15

Open errorer-max opened 1 year ago

errorer-max commented 1 year ago

Hi @mrahtz , thanks for doing this repo! I think this algorithm is a milestone in the process of deep reinforcement learning. We installed all components according to the pipfile and pipfile.lock files, and a GRPC error occurred while training the predictor network after completing the collection of preferences. There was no problem with the first round of training, but an error was reported the second time.

Hardware resources: multi-core CPU, two GPUs 1080 TI

Running environment: Python 3.7 TensorFlow1.15

Pipenv operation: python3 run.py train_policy_with_preferences EnduroNoFrameskip-v4 --n_envs 16 --render_episodes --n_initial_prefers 10

Process Process-21: Traceback (most recent call last): File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/mxm/.local/share/virtualenvs/learning-from-human-preferences-master-b0FE2Hdz/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0: The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051 Additional GRPC error information: {"created":"@1685585397.670607084","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 57349849272042118 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;c324b44e509a7e70;/job:train/replica:0/task:0/device:GPU:0;edge_106_pred_0/c2/bias/read;0:0" request_id: -5783289113748899051","grpc_status":10} [[{{node pred_0/c2/bias/read}}]]

mrahtz commented 1 year ago

Hmm, I'm sorry, I've never seen an error like that before, and I'm not sure what it means. It looks like it's coming from within TensorFlow, so my best guess is that's it's something to do with your TensorFlow, CUDA and cuDNN installations, or your GPU drivers. The only suggestion that comes to mind is to try installing NVIDIA's version of TensorFlow https://github.com/NVIDIA/tensorflow which seems to have better compatibility with newer GPU drivers.

errorer-max commented 1 year ago

Thank you very much for your suggestion. I have tried installing it https://github.com/NVIDIA/tensorflow/tree/r1.15.2 +Nv20.06, however, the result is still regrettable as the error still occurred. Before installing this version of TensorFlow, all CUDA versions on the server have been removed to avoid conflicts between CUDA versions.

2023-06-02 17:01:36.278538: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 73633454253398810 Process Process-22: Traceback (most recent call last): File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/home/mxm/.local/share/virtualenvs/rlhp-py38-ie9AYam1/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0: The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049 Additional GRPC error information: {"created":"@1685696496.277802028","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 73633454253398810 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;972bc9779fbc3f86;/job:train/replica:0/task:0/device:GPU:0;edge_216_pred_0/d2/bias/read;0:0" request_id: -8191416709793270049","grpc_status":10} [[{{node pred_0/d2/bias/read}}]]

Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. Before running the algorithm, in order to complete all processes as soon as possible, I used 'python3 run. py train_policy_with references EnduroNoFrameskip-v4-- n_envs 16-- render_episodes -- n_initial_prefs 15'. That is to say, I only input preferences 15 times before starting to train the reward prediction network. I don't know if this setting will cause this problem to occur?

mrahtz commented 1 year ago

Hey, sorry for the slow reply - busy week.

If that didn't work, sorry, I'm out of ideas. I don't think it should make any difference which order you run the commands in - this really sounds like some weird error in TensorFlow itself. My impression is that TensorFlow 1.x is really pretty unsupported these days, so it might just be that it's too old.