Closed qw1319 closed 1 month ago
I haven't tested kfactor=32 kernels in llama.cpp e2e inference. However, it seems the result is correct indicated by python tools/profile.py -k qgemm_lut
and python tools/profile.py -k preprocessor
given kfactor=32, MKNs = [[4096, 4096, 1]], act_group_size=64, group_size=128
.
I haven't tested kfactor=32 kernels in llama.cpp e2e inference. However, it seems the result is correct indicated by
python tools/profile.py -k qgemm_lut
andpython tools/profile.py -k preprocessor
givenkfactor=32, MKNs = [[4096, 4096, 1]], act_group_size=64, group_size=128
.
I mean can not get right answer(generated token is error), not performance
I haven't tested kfactor=32 kernels in llama.cpp e2e inference. However, it seems the result is correct indicated by
python tools/profile.py -k qgemm_lut
andpython tools/profile.py -k preprocessor
givenkfactor=32, MKNs = [[4096, 4096, 1]], act_group_size=64, group_size=128
.I mean can not get right answer(generated token is error), not performance
verify=True
in tools/profile.py
will verify the results against _reference
(implemented using numpy). If there isn't assert error, it indicates the computation obtains correct results.
hi when i add tune search space: add some config kfactor(32, 64, 128); then tune result selcect the kfactor=32 result. But when i run this kernel, can not get a right answer. Is this right(the red block)?