huggingface / optimum-nvidia

Apache License 2.0
887 stars 86 forks source link

Error on Quickstart example #143

Open laikhtewari opened 3 months ago

laikhtewari commented 3 months ago

Running on 1xH100 with latest docker container from docker hub

>>> fast_pipe = optimum_pipeline('text-generation', 'meta-llama/Meta-Llama-3-8B-Instruct', use_fp8=True)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Fetching 1 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 23563.51it/s]
[07/03/2024-22:25:54] Found pre-built engines at: [PosixPath('/root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/engines')]
[TensorRT-LLM][INFO] Engine version 0.9.0.dev2024031900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] The logger passed into createInferRuntime differs from one already provided for an existing builder, runtime, or refitter. Uses of the global logger, returned by nvinfer1::getLogger(), will return the existing value.
[TensorRT-LLM][INFO] Loaded engine size: 8665 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 40765, GPU 139849 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +70, now: CPU 40766, GPU 139919 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +8661, now: CPU 0, GPU 17321 (MiB)
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +64, now: CPU 40766, GPU 139887 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +64, now: CPU 40766, GPU 139951 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 17321 (MiB)
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 44416. Allocating 2910846976 bytes.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 64
>>> fast_pipe("hello, world")
[07/03/2024-22:29:08] Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/pipelines/text_generation.py", line 62, in __call__
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/pipelines/text_generation.py", line 175, in _forward
    generated_sequence, lengths = self._runtime.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/nvidia/runtime.py", line 256, in generate
    trt_inputs = ctrrt.GenerationInput(
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. tensorrt_llm.bindings.GenerationInput(end_id: int, pad_id: int, ids: Optional[torch.Tensor], lengths: Optional[torch.Tensor], packed: bool = False)

Invoked with: kwargs: end_id=[128001, 128009], pad_id=128001, ids=tensor([[128000,  15339,     11,   1917]], device='cuda:0', dtype=torch.int32), lengths=tensor([4], device='cuda:0', dtype=torch.int32), packed=True
laikhtewari commented 3 months ago

Seems to be an issue with this model, as the same workflow succeeded for llama 2 7B