[Bug] Zero temperature curl request affects non-zero temperature requests

Hao-YunDeng commented 1 month ago

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:

https://github.com/NVIDIA/TensorRT-LLM.git (https://github.com/NVIDIA/TensorRT-LLM/commit/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d) https://github.com/triton-inference-server/tensorrtllm_backend.git (ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)

Model: zephyr-7b-beta

Who can help?

@kaiyux

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted --output_dir zephyr-7b-beta-trt-engine --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --gemm_plugin float16 --paged_kv_cache enable --max_num_tokens 65536 --max_batch_size 32 --max_input_len 16384 --strongly_typed

step 3 tensorrtllm_backend parameters:

MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

step 4:

A correct curl test (run this in a loop fashion so that you can send a bad request in the middle of the process): curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.7}'

At the same time, send a bad curl request with zero temperature: curl -X POST http://127.0.0.1:8888/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_p":1, "top_k":0, "temperature":0.0}'

Expected behavior

The good curl request should get a response: {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\n machinery learning is a subset of artificial intelligence that focuses on enabling computer systems to automatically learn and improve"}%

while the bad one should return 400 error.

actual behavior

Both the good and the bad request get 400 error:

400 Client Error: Bad Request for url: http://127.0.0.1:8888/v2/models/ensemble/generate resp: {"error":"in ensemble 'ensemble', Encountered error for requestId 1627051478: Encountered an error in forward function: [TensorRT-LLM][ERROR] Assertion failed: temperature penalty param (0.000000) is out of limits (0.000000, 340282346638528859811704183484516925440.000000] (/app/tensorrt_llm/cpp/tensorrt_llm/layers/fillBuffers.h:64)\n1 0x7f3978267f71 tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 102\n2 0x7f38af8f0624 void tensorrt_llm::layers::FillBuffers::operator()(std::optional<std::vector<float, std::allocator > > const&, float, std::vector<float, std::allocator >&, float, int const, std::pair<float, float> const&, std::string const&) const + 324\n3 0x7f38af8f0c47 tensorrt_llm::layers::DynamicDecodeLayer::setupPenalties(int, int const, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 1223\n4 0x7f38af901b17 tensorrt_llm::layers::DynamicDecodeLayer::setup(int, int, int const*, tensorrt_llm::layers::DynamicDecodeLayer::SetupParams const&) + 167\n5 0x7f38af96f04c tensorrt_llm::runtime::GptDecoder::setup(tensorrt_llm::runtime::SamplingConfig const&, unsigned long, int, std::optional<std::shared_ptr > const&) + 572\n6 0x7f38af97dba3 tensorrt_llm::runtime::GptDecoderBatch::newRequests(std::vector<int, std::allocator > const&, std::vector<tensorrt_llm::runtime::decoder_batch::Request, std::allocator > const&, std::vector<tensorrt_llm::runtime::SamplingConfig, std::allocator > const&) + 483\n7 0x7f38afc5fadf tensorrt_llm::batch_manager::TrtGptModelInflightBatching::setupDecoderStep(std::vector<std::shared_ptr, std::allocator<std::shared_ptr > > const&) + 719\n8 0x7f38afc61d0a tensorrt_llm::batch_manager::TrtGptModelInflightBatching::forward(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&) + 5434\n9 0x7f38afc12854 tensorrt_llm::batch_manager::GptManager::step(std::list<std::shared_ptr, std::allocator<std::shared_ptr > >&, std::set<unsigned long, std::less, std::allocator >&) + 36\n10 0x7f38afc1a984 tensorrt_llm::batch_manager::GptManager::decoupled_execution_loop() + 404\n11 0x7f3989df2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f3989df2253]\n12 0x7f3989b81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f3989b81ac3]\n13 0x7f3989c13850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f3989c13850]"}

additional notes

None

byshiue commented 3 weeks ago

Thank you for the report. zero temperature is forbidden in TRT-LLM now. Please prevent it on front-end now. We are working to improve the checking mechanism.

Hao-YunDeng commented 3 weeks ago

Thank you for the report. zero temperature is forbidden in TRT-LLM now. Please prevent it on front-end now. We are working to improve the checking mechanism.

The zero temperature defense is only one part of the issue. I'm also wondering, why a single bad request (zero temperature) would break down other same time good requests (non-zero temperature requests)? This is very dangerous. In our practice, those good requests also responded "zero temperature" error even though we really made sure they were not zero temperature, which misled us a lot. Please also fix this part as well. Thanks

byshiue commented 3 weeks ago

Thank you for the report. We know this issue and we are fixing it.

MartinMarciniszyn commented 4 days ago

The issue is fixed in the version of main that will be published tomorrow.

byshiue commented 4 days ago

Close this issue because the issue is fixed in latest main branch. Please let us know if you still encounter issue.

triton-inference-server / tensorrtllm_backend