Closed jerryzh168 closed 2 years ago
oh maybe it's because there is still no support for tensorcore? https://github.com/openai/triton/issues/275
Maybe the matrices are too small? There are microbenchmarks here https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py and the perf should be pretty good.
Maybe the matrices are too small? There are microbenchmarks here https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py and the perf should be pretty good.
thanks for the quick response. just updated the size, it's relatively small in K dimension, do I need some special handling for that? will try benchmarking the implementation in https://github.com/openai/triton/blob/master/python/triton/ops/matmul.py to see if I can get a better perf.
it improved slightly if I use the official matmul.py from triton/ops:
----------------------------------------------------------------
M N K Time(s) Rate(TF/s)
----------------------------------------------------------------
38400, 4096, 1024, 0.001486 216.826
38400, 4096, 1024, 0.002692 119.674
but it does not match typical speedup numbers in https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L54-L71
@jerryzh168 Can you share the kernel here? I think int8 matmul performs best when a
is row-major and b
is col-major (this also applies to cublas & cutlass).
@jerryzh168 Can you share the kernel here? I think int8 matmul performs best when
a
is row-major andb
is col-major (this also applies to cublas & cutlass).
Hi, I'm copy pasting this kernel: https://github.com/openai/triton/blob/master/python/triton/ops/matmul.py and could not reproduce the perf result in https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L54-L71, int8 is always at least 2x slower than fp16
FYI maybe it's because I'm using dev mode to build the kernel, we are still trying to resolve a build issue in opt mode, the error right now is: "NotADirectoryError: [Errno 20] Not a directory: '/proc/self/fd/3/triton/code_gen.pyc'", please let me know if you can help with this as well, thanks
Hi Jerry!
Can you copy and paste the call stack to this error?
Using dev or opt mode doesn't necessarily affect the kernel performance, as long as you are not measuring the warmup runs.
Hi @Jokeren, here is the stack trace: https://gist.github.com/jerryzh168/f44c46dc0884ba8939a40e06b76f9d94, it works on dev mode btw.
I think the problem is tricky but obvious.
code/triton/python/triton/code_gen.py
this is what I got for triton.code_gen.__file__
, but your __file__
refers to a pyc file.
According to https://peps.python.org/pep-3147/#file, __file__
will always point its source py file in python3.
I think the problem is tricky but obvious.
code/triton/python/triton/code_gen.py
this is what I got fortriton.code_gen.__file__
, but your__file__
refers to a pyc file.According to https://peps.python.org/pep-3147/#file,
__file__
will always point its source py file in python3.
ah, thanks, from the pep it says: "(in Python 2, it points to the pyc file)", maybe we are using python2?
The stack trace on the top shows you were using python3, so it is confusing to me...
/usr/local/fbcode/platform010/lib/python3.8/runpy.py
right, I changed the command a bit and it works now in opt mode as well, the perf is indeed the same as dev mode... any ideas how to proceed?
I think it might be because of the gpu frequency locking, I'm not able to lock the gpu clocks or memory frequency so most of the regression tests failed https://gist.github.com/jerryzh168/d831ce3fe0262628965eebfe6f77f07f
I found the problem, looks like it's because of the layout (strides of the weight): https://github.com/openai/triton/blob/master/python/test/regression/test_performance.py#L97, is this just because of the way the specific matmul kernel was written, or is it something related to triton codegen?
Can you share the kernel here? I think int8 matmul performs best when a is row-major and b is col-major (this also applies to cublas & cutlass).
As mentioned by @daadaada
yeah, didn't understand what that means concretely before, and looks like this restriction comes from TensorCore itself?
I suppose it's indeed implementation related. Da can provide you more information.
Yep, Ampere Tensor Cores require the layout. And the transposing of i8 matrices is slow. So make sure a
is row-major and b
is col-major in matmul(a, b)
.
resolved. thanks everyone for help!
@jerryzh168 Did chaning the layout of matrices makes int8 matmul faster? I am still getting worse latency using int8. Can you share your experience here?
@jerryzh168 I am keen to know if your achieved better perf with INT8
@linxihui @david-macleod Yes. I change the layout and the latency is from 1.44ms -> 0.286ms, (M, N, K) = (4096, 4096, 4096). fp16 is 0.51ms
Hi,
I compared the matmul perf in fp16 and int8, using the tutorial code in https://triton-lang.org/master/getting-started/tutorials/03-matrix-multiplication.html#sphx-glr-getting-started-tutorials-03-matrix-multiplication-py, and got the following result:
in A100 GPU. so for fp16 the TF/s is reasonable since the peak is 314 TF/s in tensorcore, for int8 it seems to be off by a lot, is this expected?