Closed alex4321 closed 1 year ago
P.S. I am using this branch: https://github.com/johnsmith0031/alpaca_lora_4bit/tree/winglian-setup_pip
Okay, I see the following triton call:
# In testings using only 40 reps seems to be close enough and it appears to be what PyTorch uses
# PyTorch also sets fast_flush to True, but I didn't see any speedup so I'll leave the default
return triton.testing.do_bench(kernel_call, rep=40)
As well as the following stuff in the triton code:
def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
percentiles=(0.5, 0.2, 0.8),
record_clocks=False, fast_flush=False):
... times = torch.tensor([s.elapsed_time(e) for s, e in zip(start_event, end_event)])
if percentiles:
percentiles = torch.quantile(times, torch.tensor(percentiles)).tolist()
return tuple(percentiles)
else:
return torch.mean(times).item()
Don't know maybe earlier triton version were returning just one item. Will see that later and if so - make a fix pull request.
p.s. my triton version:
(longdocchat) alex4321@alex4321-B450-AORUS-ELITE:~/Documents/longdocchat/llama-wrapper$ pip show triton
Name: triton
Version: 2.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License:
Location: /home/alex4321/anaconda3/envs/longdocchat/lib/python3.11/site-packages
Requires: cmake, filelock, lit, torch
Required-by:
Don't know maybe earlier triton version were returning just one item. Will see that later and if so - make a fix pull request.
p.s. probably quite reverse: https://github.com/openai/triton/blame/main/python/triton/testing.py#L19
def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
quantiles=None,
fast_flush=True,
return_mode="mean"):
So current triton code contains quantiles=None by default.
Althrough pip downloaded earlier version for me while trying to install dependencies.
Now I am trying to mention 2.0.0.post1 especially
Okay, 2.0.0.post1 still have this
def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
percentiles=(0.5, 0.2, 0.8),
record_clocks=False, fast_flush=False):
So to make it works with currently pip-published versions I will still need fix if I don't missed something.
Since the corresponding PR were accepted (https://github.com/johnsmith0031/alpaca_lora_4bit/pull/102) - I guess I'll close the issue
I am using the weights I downloaded here: https://huggingface.co/sardukar/llama13b-4bit-v2
I initialized the monkeypatches with code, which do the following sequence of actions:
The output were:
Than:
Now I am trying to generate some completion:
But during the call of
generate
method I am getting the following stacktraceBesides, if I am trying to use
v1
llama model - such a error is not thrown.