Closed deltaguo closed 8 months ago
Yes, you'd want to specify the batch size when creating the cache. Change line 137 like so:
cache = ExLlamaCache(model, batch_size = 2)
Note that depending on the model this may use a lot more VRAM, so you might need to reduce the sequence length accordingly.
test_benchmark_inference.py: I tried to change
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda()
toids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda()
An error was reported:I want to test the effect of GPTQ when batch size>1. Is there any way?