google-deepmind / deepmind-research

This repository contains implementations and illustrative code to accompany DeepMind publications
Apache License 2.0
12.95k stars 2.55k forks source link

learn to simulate #395

Open soheilsh7 opened 1 year ago

soheilsh7 commented 1 year ago

I am trying to train a model using the fallowing command : python3.6 -m learning_to_simulate.train --data_path=./learning_to_simulate/tmp/datasets/WaterDrop/ --model_path=./learning_to_simulate/tmp/models/WaterDrop

and I get the fallowing error :

2022-11-10 13:53:37.015640: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 37778160 exceeds 10% of system memory. 2022-11-10 13:53:37.084779: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/home/soheilsh/anaconda3/envs/simulate/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/soheilsh/anaconda3/envs/simulate/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/soheilsh/anaconda3/envs/simulate/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas GEMM launch failed : a.shape=(1356, 30), b.shape=(30, 128), m=1356, n=128, k=30 [[{{node EncodeProcessDecode/graph_independent/node_model/sequential/mlp/linear_0/MatMul}}]] (1) Internal: Blas GEMM launch failed : a.shape=(1356, 30), b.shape=(30, 128), m=1356, n=128, k=30 [[{{node EncodeProcessDecode/graph_independent/node_model/sequential/mlp/linear_0/MatMul}}]] [[truediv_4/_4847]] 0 successful operations. 0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "anaconda3/envs/simulate/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) ...

GPU : NVIDIA GeForce RTX 3060 Laptop

How can I solve this problem ?

Many thanks in advance for your response :)

soheilsh7 commented 1 year ago

So, apparently there is a problem with tensorflow and nvidia 30 series GPUs. Im training the model with the same parameters in another environment with tensorflow-cpu==1.15 and it works fine Though I still dont know how to solve the mentioned problem with tensorflow-gpu

lukegreen2000 commented 1 year ago

https://www.pugetsystems.com/labs/hpc/How-To-Install-TensorFlow-1-15-for-NVIDIA-RTX30-GPUs-without-docker-or-CUDA-install-2005/#Installing_NVIDIAs_build_of_TensorFlow_115_in_a_conda_env