Closed nedo99 closed 1 year ago
@nedo99 from your log, it ran out of memory, Arc 770 only have 16GB memory, can you help to reduce the batch size in your model? thanks.
@nedo99 from your log, it ran out of memory, Arc 770 only have 16GB memory, can you help to reduce the batch size in your model? thanks.
Yes, it has 16GB, but from the logs, it tries to allocate around 10 GB. Also, the batch size is not an issue in this case. Whatever value I set, I get the same error with the same amount of bytes. I mean the batch size of 32 is relatively small. The initial value was 128.
10529419660
Do you mean, you have a tensor that need 10GB memory?
10529419660
Do you mean, you have a tensor that need 10GB memory?
Whatever batch size or model size is used, I still get the same memory and the same error.
There is this note about max 4GB allocation https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md. Tried to override with env variable ITEX_LIMIT_MEMORY_SIZE_IN_MB
, but getting t then the error:
W itex/core/utils/op_kernel.cc:355] ./itex/core/kernels/common/matmul_op.h: 385Invalid argument: Matrix size-incompatible: In[0]: [3,0], In[1]: [100,400]
terminate called after throwing an instance of 'dnnl::error
10529419660
Do you mean, you have a tensor that need 10GB memory?
Whatever batch size or model size is used, I still get the same memory and the same error. There is this note about max 4GB allocation https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md. Tried to override with env variable
ITEX_LIMIT_MEMORY_SIZE_IN_MB
, but getting t then the error:W itex/core/utils/op_kernel.cc:355] ./itex/core/kernels/common/matmul_op.h: 385Invalid argument: Matrix size-incompatible: In[0]: [3,0], In[1]: [100,400] terminate called after throwing an instance of 'dnnl::error
According to this invalid argument error, matmul op failed in initialization for checking inputs' shape. May I ask whether the shapes of in0 and in1 are reasonable? Or they were become abnormal after you set ITEX_LIMIT_MEMORY_SIZE_IN_MB?
@nedo99
Found that your model can work normally with graph mode, while eager mode seems to have bug with bfc_allocator, which is still under investigation. To use graph model, simplely add tf.compat.v1.disable_eager_execution()
into your model code. Will update the status of this issue with eager mode once it is resolved.
tf.compat.v1.disable_eager_execution()
fixed the issue for now. Thanks!
Hi,
Trying to run one example with the model training but getting the following issue:
Environment: Intel Arc 770 16GB Ubuntu 22.04 oneAPI 2023.2 Intel AI Analytics tool 2023.2
Any idea?
Regards, Nedim