johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

TypeError: '<' not supported between instances of 'tuple' and 'float' while trying to generate completion through the v2 13bit LLAMA #101

Closed alex4321 closed 1 year ago

alex4321 commented 1 year ago

I am using the weights I downloaded here: https://huggingface.co/sardukar/llama13b-4bit-v2

I initialized the monkeypatches with code, which do the following sequence of actions:

from alpaca_lora_4bit.monkeypatch.peft_tuners_lora_monkey_patch import replace_peft_model_with_int4_lora_model

replace_peft_model_with_int4_lora_model()

from alpaca_lora_4bit import autograd_4bit
autograd_4bit.switch_backend_to("triton")

The output were:

Using Triton implementation.

Than:

model, tokenizer = load_llama_model_4bit_low_ram(
    config_path="../llama13b-4bit-v2/",
    model_path="../llama13b-4bit-v2/llama13b-4bit-v2.safetensors",
    groupsize=-1,
    is_v1_model=False,
)
model_to_half(model)

Loading Model ... The safetensors archive passed at ../llama13b-4bit-v2/llama13b-4bit-v2.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata. Loaded the model in 3.21 seconds. Converted as Half.

Now I am trying to generate some completion:

wrapper = AMPWrapper(model)
wrapper.apply_generate()

prompt = '''I think the meaning of life is'''
batch = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
batch = {k: v.cuda() for k, v in batch.items()}

start = time.time()
with torch.no_grad():
    generated = model.generate(inputs=batch["input_ids"],
                               do_sample=True,
                               use_cache=True,
                               repetition_penalty=1.1,
                               max_new_tokens=20,
                               temperature=0.9,
                               top_p=0.95,
                               top_k=40,
                               return_dict_in_generate=True,
                               output_attentions=False,
                               output_hidden_states=False,
                               output_scores=False)
result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])
end = time.time()

But during the call of generate method I am getting the following stacktrace

TypeError                                 Traceback (most recent call last)
Cell In[12], line 3
      1 start = time.time()
      2 with torch.no_grad():
----> 3     generated = model.generate(inputs=batch["input_ids"],
      4                                do_sample=True,
      5                                use_cache=True,
      6                                repetition_penalty=1.1,
      7                                max_new_tokens=20,
      8                                temperature=0.9,
      9                                top_p=0.95,
     10                                top_k=40,
     11                                return_dict_in_generate=True,
     12                                output_attentions=False,
     13                                output_hidden_states=False,
     14                                output_scores=False)
     15 result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0])
     16 end = time.time()

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/amp_wrapper.py:18](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/amp_wrapper.py:18), in AMPWrapper.autocast_generate(self, *args, **kwargs)
     16 def autocast_generate(self, *args, **kwargs):
     17     with torch.amp.autocast(**self.options):
---> 18         return self.model.non_autocast_generate(*args, **kwargs)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/utils/_contextlib.py:115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/utils/_contextlib.py:115), in context_decorator..decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/generation/utils.py:1565](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/generation/utils.py:1565), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1557     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1558         input_ids=input_ids,
   1559         expand_size=generation_config.num_return_sequences,
   1560         is_encoder_decoder=self.config.is_encoder_decoder,
   1561         **model_kwargs,
   1562     )
   1564     # 13. run sample
-> 1565     return self.sample(
   1566         input_ids,
   1567         logits_processor=logits_processor,
   1568         logits_warper=logits_warper,
   1569         stopping_criteria=stopping_criteria,
   1570         pad_token_id=generation_config.pad_token_id,
   1571         eos_token_id=generation_config.eos_token_id,
   1572         output_scores=generation_config.output_scores,
   1573         return_dict_in_generate=generation_config.return_dict_in_generate,
   1574         synced_gpus=synced_gpus,
   1575         streamer=streamer,
   1576         **model_kwargs,
   1577     )
   1579 elif is_beam_gen_mode:
   1580     if generation_config.num_return_sequences > generation_config.num_beams:

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/generation/utils.py:2612](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/generation/utils.py:2612), in GenerationMixin.sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2609 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2611 # forward pass to get next token
-> 2612 outputs = self(
   2613     **model_inputs,
   2614     return_dict=True,
   2615     output_attentions=output_attentions,
   2616     output_hidden_states=output_hidden_states,
   2617 )
   2619 if synced_gpus and this_peer_finished:
   2620     continue  # don't waste resources running the code we don't need

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:688](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:688), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    685 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    687 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 688 outputs = self.model(
    689     input_ids=input_ids,
    690     attention_mask=attention_mask,
    691     position_ids=position_ids,
    692     past_key_values=past_key_values,
    693     inputs_embeds=inputs_embeds,
    694     use_cache=use_cache,
    695     output_attentions=output_attentions,
    696     output_hidden_states=output_hidden_states,
    697     return_dict=return_dict,
    698 )
    700 hidden_states = outputs[0]
    701 logits = self.lm_head(hidden_states)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:578](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:578), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    570     layer_outputs = torch.utils.checkpoint.checkpoint(
    571         create_custom_forward(decoder_layer),
    572         hidden_states,
   (...)
    575         None,
    576     )
    577 else:
--> 578     layer_outputs = decoder_layer(
    579         hidden_states,
    580         attention_mask=attention_mask,
    581         position_ids=position_ids,
    582         past_key_value=past_key_value,
    583         output_attentions=output_attentions,
    584         use_cache=use_cache,
    585     )
    587 hidden_states = layer_outputs[0]
    589 if use_cache:

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:293](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:293), in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    290 hidden_states = self.input_layernorm(hidden_states)
    292 # Self Attention
--> 293 hidden_states, self_attn_weights, present_key_value = self.self_attn(
    294     hidden_states=hidden_states,
    295     attention_mask=attention_mask,
    296     position_ids=position_ids,
    297     past_key_value=past_key_value,
    298     output_attentions=output_attentions,
    299     use_cache=use_cache,
    300 )
    301 hidden_states = residual + hidden_states
    303 # Fully Connected

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:197](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:197), in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    186 def forward(
    187     self,
    188     hidden_states: torch.Tensor,
   (...)
    193     use_cache: bool = False,
    194 ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
    195     bsz, q_len, _ = hidden_states.size()
--> 197     query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
    198     key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
    199     value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/accelerate/hooks.py:165), in add_hook_to_module..new_forward(*args, **kwargs)
    163         output = old_forward(*args, **kwargs)
    164 else:
--> 165     output = old_forward(*args, **kwargs)
    166 return module._hf_hook.post_forward(module, output)

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/autograd_4bit.py:180](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/autograd_4bit.py:180), in Autograd4bitQuantLinear.forward(self, x)
    176     out = AutogradMatmul4bit.apply(x, self.qweight, self.scales,
    177                                    self.qzeros if not self.is_v1_model else self.zeros,
    178                                    self.g_idx, self.bits, self.maxq)
    179 else:
--> 180     out = matmul4bit_with_backend(x, self.qweight, self.scales,
    181                                   self.qzeros if not self.is_v1_model else self.zeros,
    182                                   self.g_idx, self.bits, self.maxq)
    183 if not self.disable_bias:
    184     out += self.bias

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/autograd_4bit.py:139](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/autograd_4bit.py:139), in matmul4bit_with_backend(x, qweight, scales, qzeros, g_idx, bits, maxq)
    137 elif backend == 'triton':
    138     assert qzeros.dtype == torch.int32
--> 139     return tu.triton_matmul(x, qweight, scales, qzeros, g_idx, bits, maxq)
    140 else:
    141     raise ValueError('Backend not supported.')

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/triton_utils.py:219](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/triton_utils.py:219), in triton_matmul(input, qweight, scales, qzeros, g_idx, bits, maxq)
    217 output = torch.empty((input.shape[0], qweight.shape[1]), device=scales.device, dtype=torch.float16)
    218 grid = lambda META: (triton.cdiv(input.shape[0], META['BLOCK_SIZE_M']) * triton.cdiv(qweight.shape[1], META['BLOCK_SIZE_N']),)
--> 219 matmul_248_kernel[grid](input, qweight, output,
    220                         scales, qzeros, g_idx,
    221                         input.shape[0], qweight.shape[1], input.shape[1], bits, maxq,
    222                         input.stride(0), input.stride(1),
    223                         qweight.stride(0), qweight.stride(1),
    224                         output.stride(0), output.stride(1),
    225                         scales.stride(0), qzeros.stride(0))
    226 output = output.reshape(outshape)
    227 return output

File [~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/custom_autotune.py:93](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/longdocchat/llama-wrapper/nbs/~/anaconda3/envs/longdocchat/lib/python3.11/site-packages/alpaca_lora_4bit/custom_autotune.py:93), in Autotuner.run(self, *args, **kwargs)
     91 bench_end = time.time()
     92 self.bench_time = bench_end - bench_start
---> 93 self.cache[key] = builtins.min(timings, key=timings.get)
     94 self.hook(args)
     95 self.configs_timings = timings

TypeError: '<' not supported between instances of 'tuple' and 'float'

Besides, if I am trying to use v1 llama model - such a error is not thrown.

alex4321 commented 1 year ago

P.S. I am using this branch: https://github.com/johnsmith0031/alpaca_lora_4bit/tree/winglian-setup_pip

alex4321 commented 1 year ago

Okay, I see the following triton call:

            # In testings using only 40 reps seems to be close enough and it appears to be what PyTorch uses
            # PyTorch also sets fast_flush to True, but I didn't see any speedup so I'll leave the default
            return triton.testing.do_bench(kernel_call, rep=40)

As well as the following stuff in the triton code:

def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
             percentiles=(0.5, 0.2, 0.8),
             record_clocks=False, fast_flush=False):
...    times = torch.tensor([s.elapsed_time(e) for s, e in zip(start_event, end_event)])
    if percentiles:
        percentiles = torch.quantile(times, torch.tensor(percentiles)).tolist()
        return tuple(percentiles)
    else:
        return torch.mean(times).item()

Don't know maybe earlier triton version were returning just one item. Will see that later and if so - make a fix pull request.

alex4321 commented 1 year ago

p.s. my triton version:

(longdocchat) alex4321@alex4321-B450-AORUS-ELITE:~/Documents/longdocchat/llama-wrapper$ pip show triton
Name: triton
Version: 2.0.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /home/alex4321/anaconda3/envs/longdocchat/lib/python3.11/site-packages
Requires: cmake, filelock, lit, torch
Required-by: 
alex4321 commented 1 year ago

Don't know maybe earlier triton version were returning just one item. Will see that later and if so - make a fix pull request.

p.s. probably quite reverse: https://github.com/openai/triton/blame/main/python/triton/testing.py#L19

def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
             quantiles=None,
             fast_flush=True,
             return_mode="mean"):

So current triton code contains quantiles=None by default.

Althrough pip downloaded earlier version for me while trying to install dependencies.

Now I am trying to mention 2.0.0.post1 especially

alex4321 commented 1 year ago

Okay, 2.0.0.post1 still have this

def do_bench(fn, warmup=25, rep=100, grad_to_none=None,
             percentiles=(0.5, 0.2, 0.8),
             record_clocks=False, fast_flush=False):

So to make it works with currently pip-published versions I will still need fix if I don't missed something.

alex4321 commented 1 year ago

Since the corresponding PR were accepted (https://github.com/johnsmith0031/alpaca_lora_4bit/pull/102) - I guess I'll close the issue