taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.09k stars 2.26k forks source link

Irregular Multi-dimensional Array Support #3087

Open ifsheldon opened 2 years ago

ifsheldon commented 2 years ago

Concisely describe the proposed feature I would like to have a feature that allows irregular multi-dimensional arrays, which have a shape of (f0, f1, f2, ..., fn, v1, v2, ..., vm) where f is fixed dimensions and v is variable dimensions. For example, I need an array of matrices of different shapes, then I will need a field of shape (n, v1, v2) where n is the number of matrices, and v1, v2 are different for each matrix in the array.

Describe the solution you'd like (if any) I think one of the solution will be taking advantage of sparse fields. I'm not very familiar with sparse fields in Taichi, but I think an adhoc workaround will be something like this.

# suppose we have 2 matrices, one of shape (15, 16) and another of (17, 18)
a = torch.randn(15,16)
b = torch.randn(17,18)
max_dim = 18
block_size = 4
block_num = max_dim // block_size + 1
matrix_array = ti.field(ti.f32)
matrix_array_block = ti.root.dense(ti.i, 2).pointer(ti.ij, (block_num, block_num)).dense(ti.ij, (block_size, block_size))
matrix_array_block.place(matrix_array)
# then how can we load a, b into this sparse field?

Additional comments I need this because when I do particle simulations, I have thousands of particles, each of which has a matrix of varying shape. The shapes can be calculated in advance, so I can allocate enough memory to store them but they are different, unfortunately.

bobcao3 commented 2 years ago

how vastly different will the matrices be in terms of size?

ifsheldon commented 2 years ago

Very, I would say, because it depends on simulation configurations and user inputs, so I don't want to do padding in order to avoid extreme cases.

bobcao3 commented 2 years ago

We do have a plan for sparse matrix on the way, but I don't think that would be the best choice? I think a reasonable choice is to use pointers, there is a problem of complexity, as each SNode may need to translate into a memory layout, and if the values are very different, we need a dynamic layout behind the pointer to avoid compiling so many different memory layouts.

bobcao3 commented 2 years ago

Do you have a specific paper or an algorithm so we can look at the potential ways to support these cases?

ifsheldon commented 2 years ago

I don't know if pointers will lead to some lack of locality, although I used it as a workaround and to save some memory compared to paddings. By lack of locality, I mean, for example, I have two sequences of different length.

a = torch.randn(4)
b = torch.randn(5)
max_dim = 5
block_size = 2
block_num = max_dim // block_size + 1
matrix_array = ti.field(ti.f32)
matrix_array_block = ti.root.dense(ti.i, 2).pointer(ti.i, block_num).dense(ti.i, block_size)
matrix_array_block.place(matrix_array)

# we have 5 blocks in total
# memory location low ------------------------------------> high
# ideally it should be like:  [a0, a1], [a2, a3], [b0, b1], [b2, b3], [b4, b5(inactive)]
# but with pointers, we "can" arrange data like: 
# [a2, a3], [a0, a1], [b2, b3], [b4, b5(inactive)], [b0, b1]

I don't know how you layout data that pointers are pointing to, but you should see my point from the above example. I used pointers to store large irregular matrices of different shapes as small regular patches, but then I somehow lost the control of how to layout these patches to improve cache hits. So, we need an API that tells Taichi compiler that it can break these irregular tensors into small patches, but it should place these patches close.

I only know NestedTensor which is a PyTorch library which deals with this problem to some degree, but it does not support Autograd. And PackedSequence in PyTorch also deal with it to some degree in the context of RNN.

I think one advantage of Taichi is that autodiff should work regardless shapes of fields, then the only thing you need to take care is how to layout these irregular tensors in a performant way.

I think the solution to this problem may be a academic contribution, and one strong use case, which is in the introduction of NestedTensor, is batched convolution on images of very different resolutions. Although we can do it easily in CUDA, just allocating memory and using pointers anyway, it introduces many inefficient memory accesses, as you GPU experts can see, and we don't have autodiff with raw CUDA.