What is the expected perf for int8 matmul?

jerryzh168 commented 2 years ago

Hi,

I compared the matmul perf in fp16 and int8, using the tutorial code in https://triton-lang.org/master/getting-started/tutorials/03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py, and got the following result:

----------------------------------------------------------------
         M         N          K          Time(s)      Rate(TF/s)
----------------------------------------------------------------
     38400,       4096,       1024,       0.001457     221.026
     38400,       4096,       1024,       0.004148     77.664

in A100 GPU. so for fp16 the TF/s is reasonable since the peak is 314 TF/s in tensorcore, for int8 it seems to be off by a lot, is this expected?

jerryzh168 commented 2 years ago

oh maybe it's because there is still no support for tensorcore? https://github.com/openai/triton/issues/275

ptillet commented 2 years ago

Maybe the matrices are too small? There are microbenchmarks here https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py and the perf should be pretty good.

jerryzh168 commented 2 years ago

Maybe the matrices are too small? There are microbenchmarks here https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py and the perf should be pretty good.

thanks for the quick response. just updated the size, it's relatively small in K dimension, do I need some special handling for that? will try benchmarking the implementation in https://github.com/openai/triton/blob/master/python/triton/ops/matmul.py to see if I can get a better perf.

jerryzh168 commented 2 years ago

it improved slightly if I use the official matmul.py from triton/ops:

----------------------------------------------------------------
         M         N          K          Time(s)      Rate(TF/s)
----------------------------------------------------------------
     38400,       4096,       1024,       0.001486     216.826
     38400,       4096,       1024,       0.002692     119.674

but it does not match typical speedup numbers in https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L54-L71

daadaada commented 2 years ago

@jerryzh168 Can you share the kernel here? I think int8 matmul performs best when a is row-major and b is col-major (this also applies to cublas & cutlass).

jerryzh168 commented 2 years ago

@jerryzh168 Can you share the kernel here? I think int8 matmul performs best when a is row-major and b is col-major (this also applies to cublas & cutlass).

Hi, I'm copy pasting this kernel: https://github.com/openai/triton/blob/master/python/triton/ops/matmul.py and could not reproduce the perf result in https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L54-L71, int8 is always at least 2x slower than fp16

jerryzh168 commented 2 years ago

FYI maybe it's because I'm using dev mode to build the kernel, we are still trying to resolve a build issue in opt mode, the error right now is: "NotADirectoryError: [Errno 20] Not a directory: '/proc/self/fd/3/triton/code_gen.pyc'", please let me know if you can help with this as well, thanks

Jokeren commented 2 years ago

Hi Jerry!

Can you copy and paste the call stack to this error?

Using dev or opt mode doesn't necessarily affect the kernel performance, as long as you are not measuring the warmup runs.

jerryzh168 commented 2 years ago

Hi @Jokeren, here is the stack trace: https://gist.github.com/jerryzh168/f44c46dc0884ba8939a40e06b76f9d94, it works on dev mode btw.

Jokeren commented 2 years ago

I think the problem is tricky but obvious.

code/triton/python/triton/code_gen.py this is what I got for triton.code_gen.__file__, but your __file__ refers to a pyc file.

According to https://peps.python.org/pep-3147/#file, __file__ will always point its source py file in python3.

jerryzh168 commented 2 years ago

I think the problem is tricky but obvious.

code/triton/python/triton/code_gen.py this is what I got for triton.code_gen.__file__, but your __file__ refers to a pyc file.

According to https://peps.python.org/pep-3147/#file, __file__ will always point its source py file in python3.

ah, thanks, from the pep it says: "(in Python 2, it points to the pyc file)", maybe we are using python2?

Jokeren commented 2 years ago

The stack trace on the top shows you were using python3, so it is confusing to me...

/usr/local/fbcode/platform010/lib/python3.8/runpy.py

jerryzh168 commented 2 years ago

right, I changed the command a bit and it works now in opt mode as well, the perf is indeed the same as dev mode... any ideas how to proceed?

jerryzh168 commented 2 years ago

I think it might be because of the gpu frequency locking, I'm not able to lock the gpu clocks or memory frequency so most of the regression tests failed https://gist.github.com/jerryzh168/d831ce3fe0262628965eebfe6f77f07f

jerryzh168 commented 2 years ago

I found the problem, looks like it's because of the layout (strides of the weight): https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L97, is this just because of the way the specific matmul kernel was written, or is it something related to triton codegen?

Jokeren commented 2 years ago

Can you share the kernel here? I think int8 matmul performs best when a is row-major and b is col-major (this also applies to cublas & cutlass).

As mentioned by @daadaada

jerryzh168 commented 2 years ago

yeah, didn't understand what that means concretely before, and looks like this restriction comes from TensorCore itself?

Jokeren commented 2 years ago

I suppose it's indeed implementation related. Da can provide you more information.

daadaada commented 2 years ago

Yep, Ampere Tensor Cores require the layout. And the transposing of i8 matrices is slow. So make sure a is row-major and b is col-major in matmul(a, b).

jerryzh168 commented 2 years ago

resolved. thanks everyone for help!

linxihui commented 2 years ago

@jerryzh168 Did chaning the layout of matrices makes int8 matmul faster? I am still getting worse latency using int8. Can you share your experience here?

david-macleod commented 2 years ago

@jerryzh168 I am keen to know if your achieved better perf with INT8

PeiqinSun commented 1 year ago

@linxihui @david-macleod Yes. I change the layout and the latency is from 1.44ms -> 0.286ms, (M, N, K) = (4096, 4096, 4096). fp16 is 0.51ms

triton-lang / triton

What is the expected perf for int8 matmul? #634