intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

Proposal: Switch to safer data_ptr API #654

Closed Stonepia closed 1 month ago

Stonepia commented 2 months ago

🐛 Describe the bug

TL;DR

  1. We should use safer data_ptr accessing API. Newer template APIs have additional checks. If possible, use tensor.mutable_data_ptr<T>() and tensor.const_data_ptr<T>(). Stop using <scalar_t>tensor.data_ptr().
  2. Stop using char*. Either use int * or use the allocator directly for buffer accessing.

1. Use template data_ptr API for tensor

1.1 Existing code usage and its problem

Existing code uses raw pointers like this:

// non-template usage. Avoid using this
kernel_call(
    (scalar_t*)tensor.data_ptr();
);

The above usage doesn't check whether the tensor is initialized or not. When the tensor's storage is not initialized (for example, FakeTensor), this data_ptr is null. Then the kernel may try to write to a non-initialized memory, and causing a page fault.

1.2 template data_ptr

The template data_ptr will do additional checks. Like has_storage() and storage_initialized(). Please refer to the PyTorch TensorImpl.h for detail.

In view of that, change to a safer usage for old usage.

// template usage. Not encouraged.
input_.data_ptr<scalar_t>();

1.3 New API: mutable_data_ptr() and const_data_ptr()

PyTorch has introduced new APIs for tensor.data_ptr(). Their usage is the same with the template data_ptr. Their meaning is just as the name suggests.

// Use this
input_.const_data_ptr<scalar_t>();
input_.mutable_data_ptr<scalar_t>();

2. Avoid using char * data ptr for buffer

The existing code has the following pattern:

For example, the code snippet in torch-xpu-ops/...Reduce.h:

Tensor buffer;
Tensor semaphores;

buffer = at::empty(
        config.global_memory_size(),
        at::TensorOptions().dtype(kChar).device(kXPU));
semaphores = at::empty(
        config.semaphore_size(), at::TensorOptions().dtype(kChar).device(kXPU));
    at::detail::Array<char*, 1> data;

data[0] = (char*)semaphores.data_ptr();
buf_ptr = buffer.defined() ? (void*)buffer.data_ptr() : nullptr,

The template tensor.data_ptr<T> does not support the char type. Thus, we should avoid using this. We could align with PyTorch's writing usage with the following:

2.1 Use int* instead of char*

For example, PyTorch has the following code in Normalization.cuh

at::Tensor semaphores;
semaphores = at::zeros({grid.x}, input.options().dtype(at::kInt));

int* semaphores_ptr = grid.y > 1 ? semaphores.mutable_data_ptr<int>() : nullptr;

However, the above usage is not the best practice, this usage is constructing a tensor and using data in it. We could directly use an allocator. See section 2.2.

2.2 Use the allocator directly.

For example, for the code snippet in Reduce.cuh:

at::DataPtr buffer;
at::DataPtr semaphores;

if (config.should_global_reduce()) {
    auto& allocator = *c10::cuda::CUDACachingAllocator::get();
    buffer = allocator.allocate(config.global_memory_size());
    semaphores = allocator.allocate(config.semaphore_size());
   //...
}

auto reduce_kernel = ReduceOp<scalar_t, ops_t, uint32_t, out_scalar_t, vt0>(
       buffer.get(),
      (int*)semaphores.get(),
);

It directly uses the allocator, rather than constructing the tensor. I believe that this usage could have some performance gain.

Stonepia commented 2 months ago

For the existing code, I changed the first part in PR : https://github.com/intel/torch-xpu-ops/pull/655 The second part is not solved yet.