Hello! There is an OOM error when I run dannce-train with 160 frames labels and 201600 frames. Very appreciate it if any one has suggestions. @spoonsso @davidhildebrand Thank you very much.
here is the error information:
`2023-04-15 01:30:16.395703: I tensorflow/core/common_runtime/bfc_allocator.cc:1042] total_region_allocatedbytes: 8848282624 memorylimit: 8848282752 available bytes: 128 curr_region_allocationbytes: 17696565760
2023-04-15 01:30:16.395877: I tensorflow/core/common_runtime/bfc_allocator.cc:1048] Stats:
Limit: 8848282752
InUse: 8848217856
MaxInUse: 8848218112
NumAllocs: 541
MaxAllocSize: 1846123264
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2023-04-15 01:30:16.396440: W tensorflow/core/common_runtime/bfc_allocator.cc:441] ****
2023-04-15 01:30:16.396590: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "E:\anaconda\envs\tfnew_25\Scripts\dannce-train-script.py", line 33, in
sys.exit(load_entry_point('dannce', 'console_scripts', 'dannce-train')())
File "f:\dannce\dannce\cli.py", line 66, in dannce_train_cli
dannce_train(params)
File "f:\dannce\dannce\interface.py", line 1272, in dannce_train
workers=6,
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
tmp_logs = self.train_function(iterator)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in call
result = self._call(*args, *kwds)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call
return self._stateless_fn(args, **kwds)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 2943, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 560, in call
ctx=ctx)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[div_no_nan/ReadVariableOp_1/_78]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Function call stack:
train_function -> train_function
2023-04-15 01:30:16.613096: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
Hello! There is an OOM error when I run dannce-train with 160 frames labels and 201600 frames. Very appreciate it if any one has suggestions. @spoonsso @davidhildebrand Thank you very much.
here is the error information: `2023-04-15 01:30:16.395703: I tensorflow/core/common_runtime/bfc_allocator.cc:1042] total_region_allocatedbytes: 8848282624 memorylimit: 8848282752 available bytes: 128 curr_region_allocationbytes: 17696565760 2023-04-15 01:30:16.395877: I tensorflow/core/common_runtime/bfc_allocator.cc:1048] Stats: Limit: 8848282752 InUse: 8848217856 MaxInUse: 8848218112 NumAllocs: 541 MaxAllocSize: 1846123264 Reserved: 0 PeakReserved: 0 LargestFreeBlock: 0
2023-04-15 01:30:16.396440: W tensorflow/core/common_runtime/bfc_allocator.cc:441] **** 2023-04-15 01:30:16.396590: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_3d.cc:327 : Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc Traceback (most recent call last): File "E:\anaconda\envs\tfnew_25\Scripts\dannce-train-script.py", line 33, in
sys.exit(load_entry_point('dannce', 'console_scripts', 'dannce-train')())
File "f:\dannce\dannce\cli.py", line 66, in dannce_train_cli
dannce_train(params)
File "f:\dannce\dannce\interface.py", line 1272, in dannce_train
workers=6,
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
tmp_logs = self.train_function(iterator)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in call
result = self._call(*args, *kwds)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call
return self._stateless_fn(args, **kwds)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 2943, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\function.py", line 560, in call
ctx=ctx)
File "E:\anaconda\envs\tfnew_25\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[1,512,18,18,18] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node model_1/model/conv3d_7/Conv3D (defined at \threading.py:926) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_7873]
Function call stack: train_function -> train_function
2023-04-15 01:30:16.613096: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]