turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

does the benchmark support batch size>1? #304

Closed deltaguo closed 8 months ago

deltaguo commented 9 months ago

test_benchmark_inference.py: I tried to change ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda() to ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda() An error was reported:

Traceback (most recent call last):
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <module>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 56, in timer
    ret = func()
          ^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <lambda>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 44, in next_logits
    n_logits = model.forward(input_ids, cache, last_id_only, lora=apply_lora, input_mask=input_mask)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 972, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 1058, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (2) exceeds dimension size (1).

I want to test the effect of GPTQ when batch size>1. Is there any way?

turboderp commented 9 months ago

Yes, you'd want to specify the batch size when creating the cache. Change line 137 like so:

cache = ExLlamaCache(model, batch_size = 2)

Note that depending on the model this may use a lot more VRAM, so you might need to reduce the sequence length accordingly.