model_inputs = tokenizer(2*["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], return_tensors="pt").to("cuda")
model.generate( ...
I get
Traceback (most recent call last):
File "/root/generate.py", line 15, in <module>
generated_ids = model.generate(
File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/runtime.py", line 175, in generate
self._session.generate(trt_outputs, trt_inputs, generation_config)
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: numMicroBatches <= mMicroBatchConfig.numGenBatches (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/gptSession.cpp:684)
...
is batched generation not supported or am I doing something wrong?
edit: I figured it out, apparently you need to set max_batch_size when loading the model, otherwise it defaults to 1:
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
use_fp8=True,
max_batch_size=2
)
Hey, Jack, sorry for for bothering you here but I couldn't find another contact, I would like to hire huemint at an enterprise level, do you have an email address so we can talk?
My email is joao@advolve.ai
I got the sample code working, but when I try:
I get
is batched generation not supported or am I doing something wrong?
edit: I figured it out, apparently you need to set max_batch_size when loading the model, otherwise it defaults to 1: