taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.53k stars 2.29k forks source link

Add dlpack as the backend for copyless tensor operations between Taichi and different frameworks #4534

Open HighCWu opened 2 years ago

HighCWu commented 2 years ago

Concisely describe the proposed feature I hope that the compiler can add a dlpack backend, so that other deep learning frameworks can operate their own tensor through taichi's operators or taichi can call other framework's operators to change the value of taichi's field.

Describe the solution you'd like (if any) I noticed that taichi is trying to add some functions like to_torch, from_torch to interoperate on both ends. But the implementation is to convert to cpu numpy ndarray first, which is obviously not ok. After I confirmed again, I found that in the c++ implementation of taichi, the pointer of a torch tensor is accessed through the data_ptr() function. However, this implementation seems to require the torch tensor's memory to be contiguous.

In addition, the PaddlePaddle Hackathon that taichi participated in also hopes to enable taichi and paddlepaddle to achieve tensor interoperability.

In fact, both pytorch and paddlepaddle have implemented from_dlpack and to_dlpack, as have mxnet and cupy. It seems that tensorflow's support for dlpack is also experimental.

Taichi's code to implement dlpack should not be complicated. I searched for dlpack related code in paddlepaddle's repo, the number of files and codes is not much.

I checked the pytorch documentation about dlpack and found that the new version of dlpack does not recommend using the to_dlpack function. It seems that objects that contain __dlpack__ member function can be automatically converted to dlpack for operations.

I recommend taichi to complete these to add dlpack to taichi:

Additional comments It should work like this:

import paddle
import taichi as ti

x = paddle.ones([8, 4])
print(x)
dlpack = paddle.utils.dlpack.to_dlpack(x)
print(dlpack)

x_taichi = ti.utils.dlpack.from_dlpack(dlpack)
@ti.kernel
def paint(t: float):
    for i, j in x_taichi:  # Parallelized over all pixels
        x_taichi[i, j] = t
paint(0.0)

print(x)

It should output:

Tensor(shape=[8, 4], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
<capsule object "dltensor" at 0x7f6103c681b0>
Tensor(shape=[8, 4], dtype=float32, place=CUDAPlace(0), stop_gradient=True,
       [[0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.],
        [0., 0., 0., 0.]])

It seems like, with the help of dlpack, we could modify the value of paddle tensor in taichi.

It should be the same in reverse.

Although paddlepaddle's hackathon provides some bonuses for implementing similar functions, I am not good at this kind of programming to participate in this activity. So I only put forward my own requests and ideas here.

strongoier commented 2 years ago

cc: @bobcao3 @qiao-bo

k-ye commented 2 years ago

Thanks for suggesting this, this looks like a very interesting solution! I also saw that TVM adopts dlpack as well.

Just for my own understanding:

this implementation seems to require the torch tensor's memory to be contiguous.

Does this mean that dlpack tensors is not limited by this?

Also, what happens if torch and Taichi are not using the same device, e.g. CPU-torch + CUDA-Taichi. Would dlpack handle the data synchronization automatically? (You're right that Taichi doesn't really do a D2H copy for Torch tensors, provided that they are using the same device)

HighCWu commented 2 years ago

In the documentation of dlpack, they mention:

Semantics

DLPack describes the memory layout of strided, n-dimensional arrays. When a user calls y = from_dlpack(x), the library implementing x (the “producer”) will provide access to the data from x to the library containing from_dlpack (the “consumer”). If possible, this must be zero-copy (i.e. y will be a view on x). If not possible, that library may make a copy of the data. ...

I haven't tried it, but according to the description, it seems that zero copy is not possible in some cases.

But when I simply run some code in pytorch, it seems dlpack supports different memory layouts:

>>> import torch
>>> a = torch.randn(2,3,3).cuda()
>>> b = a.permute(1,2,0)
>>> torch.utils.dlpack.to_dlpack(b)
<capsule object "dltensor" at 0x7fc9130b14b0>
>>> c = torch.utils.dlpack.to_dlpack(b)
>>> d = torch.utils.dlpack.from_dlpack(c)
>>> a
tensor([[[ 0.8432,  0.9891, -0.0379],
         [ 0.2012,  0.8790,  0.6043],
         [ 2.3488,  0.3679, -0.6114]],

        [[ 0.2548,  0.6476, -1.0119],
         [-0.5102, -2.1939, -0.6964],
         [-0.0561, -1.5649,  0.1348]]], device='cuda:0')
>>> d[:] = 0
>>> a
tensor([[[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]],

        [[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]]], device='cuda:0')

The changes to tensor d are synced to tensor a.

It's not clear to me if such an example is a case for that memory layouts are not the same. At least it looks like it's working.

In this page of dlpack, they also mention:

  • Designed for cross hardware: CPU, CUDA, OpenCL, Vulkan, Metal, VPI, ROCm, WebGPU, Hexagon

I believe they have done cross-device conversion internally, but we need to experiment, such as data manipulation experiments between cpu-mxnet and cuda-pytorch. I see that dlpack-related code is also merged into numpy's repo. I'm not sure if dlpack is already available in the latest numpy.

I don't have much time to configure the environment, hope someone can test the features of dlpack.