taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.51k stars 2.28k forks source link

Any parallel sorting on the GPU with Taichi or thirdparty ? #3764

Open GrapixLeGrand opened 2 years ago

GrapixLeGrand commented 2 years ago

Hello,

I could not find a thread (nor in the docs) that would cover this subject, therefore I created one. Is there any parallel sorting algorithm in taichi? I'm looking for a radix/counting sort. Is there a known taichi implementation of a radix/counting sort out there? I was considering writing my own following this paper. However, according to the doc (under the atomics section), the atomic assign operation is missing and it would be needed for step 4 of the algorithm b[c[r[a[i]]]++] = a[i];.

Best, Quentin

qiao-bo commented 2 years ago

Hi Quentin, I am not aware of an existing GPU-accelerated sorting algo in Taichi, perhaps other people can comment on this.

Regarding to the atomic assign operation, would this work (doc)?

b[ti.atomic_add(c[b[a[i]]], 1)] = a[i]
GrapixLeGrand commented 2 years ago

Thanks for the reply. I'll give it a try but in fact the paper that I cited proposed a solution without the atomic, do you know how to get the maximum amount of thread? It would replace b[ti.atomic_add(c[b[a[i]]], 1)] = a[i] by: Capture_ok

If this is not possible then I'll try the atomic. What you proposed seems very reasonable, oddly the paper says that you need two atomics operations to perform this. This is wrong isn't it? Am I mistaken? Best, Quentin

GrapixLeGrand commented 2 years ago

Hello back,

It worked for the atomics, thanks! Now, do you know how to get the max number of global threads? Best, Quentin

qiao-bo commented 2 years ago

Hi,

I need to take a deeper look into the paper, but I think Taichi hasn't exposed the max num of global threads. A Taichi kernel may be compiled to e.g., multiple CUDA kernels and they could have different number of threads. But this is roughly how we calculated (code link):

grid_dim = min(#SMs * 32, num_threads_needed) // see above code link
block_dim = 128 by default or specified by ti.block_dim() otherwise

This may not help much, we do have a global_thread_idx() available.

qiao-bo commented 2 years ago

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

turbo0628 commented 2 years ago

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

qiao-bo commented 2 years ago

We do think a sort intrinsic can be useful for many Taichi users. We are considering add some parallel intrinsics to the language. Let us know what else you find useful ;)

Is that possible to add some hooks to for external libraries? Many such math functions / permutation utilities are already implemented in vendor libraries, CUB for example.

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

turbo0628 commented 2 years ago

Very good point! CUB exposes very efficient permutation utilities, and it's hard to be better ;). Although in Taichi we want to avoid Vendor lock-in, having some external libs could still be a good start...

Integration of any platform-specific library would definitely make Taichi less consistent, especially when switching among hardware backends. I totally agree that vendor lock-in should be avoided for Taichi, especially in the current stage. The alternative approach is to mimic vendor library with some consistent primitives. However, it has already been proven to be a hard task in TVM. The TVM community recently adopts the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

If we widen the concept of vendor libraries, they are actually equivalent with user-written codes. The second problem raises here: how to reasonably hybridize compiler generated code with user-written / vendor code? Those codes are probably to be efficient but they are just black boxes for the compiler. I don't think there's already an efficient and elegant solution, hope to see that happen in Taichi :)

bcolloran commented 2 years ago

@GrapixLeGrand following your work here with interest! If you get a working implementation, please post details back here if you are able to share :-)

k-ye commented 2 years ago

the BYOC mechanism for different vendor libraries. They are now actively working to bring in CUTLASS, oneDNN and ArmComputeLibrary.

This sounds like an interesting direction!

FYI we have some basic experiments regarding how to integrate with vendor/external libs. One example is the Sparse Matrix feature on top of Eigen (https://github.com/taichi-dev/taichi/pull/2792). Another one being @squarefk 's external LLVM bitcode call support (https://github.com/taichi-dev/taichi/pull/2873). Clearly, these are prototypes, rather than a systematic solution.

Realistically speaking, I don't think we should strive for a completely vendor-free implementation here, or the dev/maintainence cost is gonna be too high :-( I believe either the BYOC or provide a flexible extension system to wrap these vendor libs are worth a shot..

AmesingFlank commented 2 years ago

Implemented a version of odd-even parallel sort at #3790.

FantasyVR commented 2 years ago

Hi @GrapixLeGrand. We decide to make more effort on GPU sorting recently. Though we know users prefer to use counting sort/radix sort, we want to know if our odd-even sorting meets your need? If you can share your sorting implementation here, we are willing to help and provide some support in Taichi.