intel / intel-extension-for-tensorflow

Intel® Extension for TensorFlow*
Other
317 stars 40 forks source link

Dst tensor is not initialized #46

Closed nedo99 closed 1 year ago

nedo99 commented 1 year ago

Hi,

Trying to run one example with the model training but getting the following issue:

E itex/core/devices/bfc_allocator.cc:101] Allocator ran out of memory trying to allocate 10529419660 Bytes (rounded to 10529419776 Bytes)
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
Cell In[36], line 1
----> 1 model.fit(x,y, batch_size=32, epochs=10)

File ~/.conda/envs/oneapi_tensorflow/lib/python3.9/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)
     67     filtered_tb = _process_traceback_frames(e.__traceback__)
     68     # To get the full stack trace, call:
     69     # `tf.debugging.disable_traceback_filtering()`
---> 70     raise e.with_traceback(filtered_tb) from None
     71 finally:
     72     del filtered_tb

File ~/.conda/envs/oneapi_tensorflow/lib/python3.9/site-packages/tensorflow/python/framework/constant_op.py:98, in convert_to_eager_tensor(value, ctx, dtype)
     96     dtype = dtypes.as_dtype(dtype).as_datatype_enum
     97 ctx.ensure_initialized()
---> 98 return ops.EagerTensor(value, ctx.device_name, dtype)

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:XPU:0 in order to run _EagerConst: Dst tensor is not initialized.

Environment: Intel Arc 770 16GB Ubuntu 22.04 oneAPI 2023.2 Intel AI Analytics tool 2023.2

Any idea?

Regards, Nedim

guizili0 commented 1 year ago

@nedo99 from your log, it ran out of memory, Arc 770 only have 16GB memory, can you help to reduce the batch size in your model? thanks.

nedo99 commented 1 year ago

@nedo99 from your log, it ran out of memory, Arc 770 only have 16GB memory, can you help to reduce the batch size in your model? thanks.

Yes, it has 16GB, but from the logs, it tries to allocate around 10 GB. Also, the batch size is not an issue in this case. Whatever value I set, I get the same error with the same amount of bytes. I mean the batch size of 32 is relatively small. The initial value was 128.

guizili0 commented 1 year ago

10529419660

Do you mean, you have a tensor that need 10GB memory?

nedo99 commented 1 year ago

10529419660

Do you mean, you have a tensor that need 10GB memory?

Whatever batch size or model size is used, I still get the same memory and the same error. There is this note about max 4GB allocation https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md. Tried to override with env variable ITEX_LIMIT_MEMORY_SIZE_IN_MB, but getting t then the error:

W itex/core/utils/op_kernel.cc:355] ./itex/core/kernels/common/matmul_op.h: 385Invalid argument: Matrix size-incompatible: In[0]: [3,0], In[1]: [100,400]
terminate called after throwing an instance of 'dnnl::error
cboss6 commented 1 year ago

10529419660

Do you mean, you have a tensor that need 10GB memory?

Whatever batch size or model size is used, I still get the same memory and the same error. There is this note about max 4GB allocation https://github.com/intel/compute-runtime/blob/master/programmers-guide/ALLOCATIONS_GREATER_THAN_4GB.md. Tried to override with env variable ITEX_LIMIT_MEMORY_SIZE_IN_MB, but getting t then the error:

W itex/core/utils/op_kernel.cc:355] ./itex/core/kernels/common/matmul_op.h: 385Invalid argument: Matrix size-incompatible: In[0]: [3,0], In[1]: [100,400]
terminate called after throwing an instance of 'dnnl::error

According to this invalid argument error, matmul op failed in initialization for checking inputs' shape. May I ask whether the shapes of in0 and in1 are reasonable? Or they were become abnormal after you set ITEX_LIMIT_MEMORY_SIZE_IN_MB?

cboss6 commented 1 year ago

@nedo99 Found that your model can work normally with graph mode, while eager mode seems to have bug with bfc_allocator, which is still under investigation. To use graph model, simplely add tf.compat.v1.disable_eager_execution() into your model code. Will update the status of this issue with eager mode once it is resolved.

nedo99 commented 1 year ago

tf.compat.v1.disable_eager_execution() fixed the issue for now. Thanks!