Open winstxnhdw opened 1 month ago
Since you don't share the full reproduced steps, including how do you convert the checkpoint, the request you really use and the commit/version/docker. I try the long context evaluation task of TensorRT-LLM on latest main branch (535c9cc) and I cannot reproduce the accuracy issue. The following are my steps (use 8k input):
python ./examples/quantization/quantize.py --model_dir Meta-Llama-3.1-8B/ \
--dtype bfloat16 \
--qformat int4_awq \
--awq_block_size 128 \
--output_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
--calib_size 32
python -m tensorrt_llm.commands.build --checkpoint_dir /tmp/llama-3.1/trt_ckpts/int4_awq/ \
--output_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
--gpt_attention_plugin bfloat16 \
--gemm_plugin bfloat16 \
--max_num_tokens 131072 \
--max_input_len 131072 \
--max_seq_len 131072 \
--use_paged_context_fmha enable \
--workers 1
python3 examples/infinitebench/construct_synthetic_dataset.py --test_case build_passkey --test_level 0
python examples/eval_long_context.py --task passkey \
--engine_dir /tmp/llama-3.1/trt_engines/int4_awq/ \
--tokenizer_dir Meta-Llama-3.1-8B/ \
--stop_idx 10 \
--max_input_length 8192 \
--enable_chunked_context \
--max_tokens_in_paged_kv_cache 131136
and the results are like
[11/21/2024-09:35:49] [TRT-LLM] [I] Load engine takes: 4.858942270278931 sec
[11/21/2024-09:35:49] [TRT-LLM] [I] ==== Evaluation ====
[11/21/2024-09:35:49] [TRT-LLM] [I] # examples: 275
[11/21/2024-09:35:49] [TRT-LLM] [I] Start index: 0
[11/21/2024-09:35:49] [TRT-LLM] [I] Stop index: 10
[11/21/2024-09:35:49] [TRT-LLM] [I] Max tokens: 6
[11/21/2024-09:35:58] [TRT-LLM] [I] Compute the score
10it [00:00, 26329.59it/s]
[11/21/2024-09:35:58] [TRT-LLM] [I] Evaluation takes: 8.512326717376709 sec.
[11/21/2024-09:35:58] [TRT-LLM] [I] accuracy of 10 examples: 1.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
Can you take a try on the evaluation task first?
Hey @byshiue,
This is my quantisation arguments.
python quantize.py --model_dir /Meta-Llama-3.1-8B-Instruct \
--output_dir /Meta-Llama-3.1-8B-Instruct-AWQ \
--dtype bfloat16 \
--qformat int4_awq \
--awq_block_size 64
The container tag I am using is 24.10-trtllm-python-py3
. I am not able to run the evaluation task because my company's proxy blocks MPI from being installed. You should be able to replicate the issue with any long context input. I am also using the instruct
model and not the base model.
System Info
NVIDIA A100 40 GB
Who can help?
@byshiue @ka
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Llama 3.1 should be able to handle up to 131072 tokens and according to the example here, this was demonstrated by NVIDIA to be possible, at least on the 405B parameter variant.
actual behavior
additional notes
I am using the inflight_batcher_llm repository and I have tried toggling
enable_chunked_context
on and off.