spoonsso / dannce

MIT License
214 stars 30 forks source link

OOM error #145

Closed yuan0821 closed 1 year ago

yuan0821 commented 1 year ago

Hello! There is an OOM error when I run dannce-train with 160 frames labels and 201600 frames. Very appreciate it if any one has suggestions. @spoonsso @davidhildebrand Thank you very much.

here is the error information: `2023-04-15 01:30:16.395703: I tensorflow/core/common_runtime/bfc_allocator.cc:1042] total_region_allocatedbytes: 8848282624 memorylimit: 8848282752 available bytes: 128 curr_region_allocationbytes: 17696565760 2023-04-15 01:30:16.395877: I tensorflow/core/common_runtime/bfc_allocator.cc:1048] Stats: Limit: 8848282752 InUse: 8848217856 MaxInUse: 8848218112 NumAllocs: 541 MaxAllocSize: 1846123264 Reserved: 0 PeakReserved: 0 LargestFreeBlock: 0

2023-04-15 01:30:16.396440: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **** 2023-04-15 01:30:16.396590: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "E:\anaconda\envs\tfnew_25\Scripts\dannce-train-script.py", line 33, in sys.exit(load_entry_point('dannce', 'console_scripts', 'dannce-train')()) File "f:\dannce\dannce\cli.py", line 66, in dannce_train_cli dannce_train(params) File "f:\dannce\dannce\interface.py", line 1272, in dannce_train workers=6, File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in call result = self._call(*args, *kwds) File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call return self._stateless_fn(args, **kwds) File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 2943, in call filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 560, in call ctx=ctx) File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[div_no_nan/ReadVariableOp_1/_78]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_7873]

Function call stack: train_function -> train_function

2023-04-15 01:30:16.613096: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

yuan0821 commented 1 year ago

Hi, I just by chance changed the vol_size from 300 to 240, the error did not occur then. Thank you.

data-hound commented 1 year ago

Thanks for posting the resolution. Closing this issue