batched generation - Githubissues

I got the sample code working, but when I try:

model_inputs = tokenizer(2*["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], return_tensors="pt").to("cuda")

model.generate( ...

I get

Traceback (most recent call last):
  File "/root/generate.py", line 15, in <module>
    generated_ids = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/runtime.py", line 175, in generate
    self._session.generate(trt_outputs, trt_inputs, generation_config)
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: numMicroBatches <= mMicroBatchConfig.numGenBatches (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/gptSession.cpp:684)
...

is batched generation not supported or am I doing something wrong?

edit: I figured it out, apparently you need to set max_batch_size when loading the model, otherwise it defaults to 1:

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-2-7b-chat-hf",
  use_fp8=True,
  max_batch_size=2
)

huggingface / optimum-nvidia

batched generation #93