huggingface / optimum-nvidia

Apache License 2.0
867 stars 86 forks source link

batched generation #93

Closed Jack000 closed 5 months ago

Jack000 commented 5 months ago

I got the sample code working, but when I try:

model_inputs = tokenizer(2*["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], return_tensors="pt").to("cuda")

model.generate( ...

I get

Traceback (most recent call last):
  File "/root/generate.py", line 15, in <module>
    generated_ids = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/runtime.py", line 175, in generate
    self._session.generate(trt_outputs, trt_inputs, generation_config)
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: numMicroBatches <= mMicroBatchConfig.numGenBatches (/src/tensorrt_llm/cpp/tensorrt_llm/runtime/gptSession.cpp:684)
...

is batched generation not supported or am I doing something wrong?

edit: I figured it out, apparently you need to set max_batch_size when loading the model, otherwise it defaults to 1:

model = AutoModelForCausalLM.from_pretrained(
  "meta-llama/Llama-2-7b-chat-hf",
  use_fp8=True,
  max_batch_size=2
)
joaosobreira123 commented 4 months ago

Hey, Jack, sorry for for bothering you here but I couldn't find another contact, I would like to hire huemint at an enterprise level, do you have an email address so we can talk? My email is joao@advolve.ai