int4 support - Githubissues

triton-lang / triton

Development repository for the Triton language and compiler

https://triton-lang.org/

MIT License

13.18k stars 1.62k forks source link

int4 support #675

Open stephen-youn opened 2 years ago

stephen-youn commented 2 years ago

Hi, is there a plan or an ongoing work to enable int4 data type in triton using a100? if there is an already branch, can I take a try? thanks

Jokeren commented 2 years ago

Not that I know of

ptillet commented 2 years ago

There is no branch at the moment. If INT4 proves widely useful, we'll add it. Just a bit tricky because AFAIK LLVM doesn't have a type for it.

pommedeterresautee commented 2 years ago

@ptillet may be you have seen this (recent) paper -> https://openreview.net/forum?id=tcbBPnfwxS (link to cuda code included for reproduction) int4 quant of very large models with little hit on perf, and... no need to retrain (🎉)! With bit and bytes stuff, I guess those approaches will replace the old QAT, etc. for LLM at some points.

limitation however:

In terms of limitations, our method currently does not provide speedups for the actual multiplications, due to the lack of hardware support for mixed-precision operands (e.g. FP16 x INT4) on mainstream architectures. Moreover, our
current results do not include activation quantization, as they are not a significant bottleneck in our
target scenarios; however, this can be supported using orthogonal techniques (Yao et al., 2022).

Your last comment is because you are reusing types available out of the box from MLIR/LLVM?

OrenLeung commented 1 year ago

@ptillet any updates on if INT4 support is on the roadmap?

ptillet commented 1 year ago

My understanding is that H100 doesn't support int4, so it's not on the roadmap at all. Could be useful as a storage type though, but I'm hoping we can achive that via better CUDA interop to allow users to directly process 8xint4 packed into 1xint32

OrenLeung commented 1 year ago

Thanks for the update @ptillet

can you elaborate on what you mean by CUDA interop? and how that would relate to storing in INT4 than at runtime quantize back to INT8/fp16, etc?

ttim commented 8 months ago

Hi, any updates on the issue? There are some projects showing viability of using INT4 for compute (i.e. https://github.com/efeslab/Atom), and it would be really awesome if it's supported in Triton.

jt-zhang commented 2 weeks ago

Is there a plan to support INT4 recently? Thank you.

ducviet00 commented 5 days ago

Is there a plan to support INT4 recently? Thank you.

https://gist.github.com/jlebar/3435b2c00deea53258887ce37231e5e2