triton-lang / triton

Development repository for the Triton language and compiler
https://triton-lang.org/
MIT License
12.92k stars 1.57k forks source link

Using triton to develop a molecular dynamic simulation package? #1084

Closed zhenyuwei99 closed 1 year ago

zhenyuwei99 commented 1 year ago

Hi, I am a student who uses molecular dynamic (MD) simulation frequently, which is a typical CPU-bound task. I have tried to use numba.cuda and cupy to develop a python package for conducting MD simulation.

I am attracted by the concise language and the usage of user-defined data structures shown in the document. However, I notice all the tutorials focus on the DL tasks. I am wondering if is it possible to use triton to replace numba.cuda?

The main job of the MD simulation is to calculate the distance between two neighbor particles. This job requires two steps:

I need to optimize the cuda kernel to ensure the memory loading efficiency in this job like memory coalescing. I am wondering if triton will give a good performance on this occasion

Jokeren commented 1 year ago

I am a student who uses molecular dynamic (MD) simulation frequently, which is a typical CPU-bound task

Have you tried using NAMD and LAMMPS? I have found that LAMMPS's stencil-like kernels are generally well-optimized, and NAMD has incorporated many cool optimizations to keep data on the GPU. Before moving forward, it may be helpful for you to confirm a few things:

  1. Whether or not the existing package reaches the performance upper bound.
  2. If you require fp64 dots in the kernels you're interested in.
  3. If you need indexing operations, since these are not yet supported by Triton.
zhenyuwei99 commented 1 year ago

Thanks for your reply!

For the questions:

  1. I think both NAMD and LAMMPS have a good performance. NAMD is well-optimized on the GPU. However, it is hard to use these two packages as a toolbox for a new kind of simulation as both of them work as a compiler. They read a rigorously defined input script, and give output. The researchers need to touch the complicated c++ code when they try to do something new. So I hope there can be a python package (not just python API) for MD.
  2. In most cases, fp32 is enough
  3. I am not sure about which kind of indexing operation you mentioned here. In many cases, reading the indexing from one array and using it to load the data from another array. But there are also other optimized kernels that read the data continuously.
Jokeren commented 1 year ago
  1. Please refer to https://github.com/openai/triton/issues/974 and https://github.com/openai/triton/issues/1023

We currently don't support the following case:

x = tl.load(x_ptr + offsets, mask = mask)
idx = tl.load(idx_ptr + offsets, mask = mask)
output = tl.zeros([BLOCK_SIZE, ], dtype=tl.float32)
for i in range(0, BLOCK_SIZE):
      output[i] = x[idx[i]]
zhenyuwei99 commented 1 year ago

So it seems like we must use a specific integer as the index of a tensor. idx[i] is not clear to the compiler.

Here I have two questions about triton:

Jokeren commented 1 year ago

So it seems like we must use a specific integer as the index of a tensor.

This is not accurate. Maybe I didn't provide clear examples.

Not that there's a workaround here for the above example:

idx = idx.to(tl.int32)
output = tl.load(x_ptr + idx)

So indexing a triton tensor is not implemented yet, but sometimes you can have workarounds.

Jokeren commented 1 year ago

I am wondering if the main idea of triton is to optimize the memory loading and writing efficiency by tl.load and tl.store to increase the code performance. By doing this, we don't need to pay too much attention to memory optimization and keep the task in the compute-bound region.

You can refer to the triton paper in README. There are many additional benefits like a simple interface for tensor cores.

How does triton treat the for loop in the kernel function? I think this have a great effect on the code performance.

A for loop is just a for loop. Though we may transform it a bit if there are dot operations.

zhenyuwei99 commented 1 year ago

A for loop is just a for loop. Though we may transform it a bit if there are dot operations.

But how to do that parallelly on all threads of one warp? Or it is better to be the tensor operation and the kernel is just a serial code?

Jokeren commented 1 year ago

But how to do that parallelly on all threads of one warp? Or it is better to be the tensor operation and the kernel is just a serial code?

Each kernel is composed of multiple blocks, you can view operations on each block as serial, which is different from using CUDA.

zhenyuwei99 commented 1 year ago

So triton avoids the detailed operation of each thread and provides a higher level API like dot to avoid the unoptimized operation and increase the performance?

Jokeren commented 1 year ago

That's one valid point; basically Triton is a DSL that is tuned for a specific domain.

If you want a more formal answer, maybe you could try ChatGPT :)

zhenyuwei99 commented 1 year ago

Thanks a lot for your detailed explanation! I think all questions have been answered