intel / llm-on-ray

Pretrain, finetune and serve LLMs on Intel platforms with Ray
Apache License 2.0
58 stars 27 forks source link

Openai API not allow temperature=0.0 for llama-2-7b-chat-hf #139

Open yutianchen666 opened 3 months ago

yutianchen666 commented 3 months ago

When running the llama-2-7b-chat-hf model with openai api for gsm8k(Mathematical Ability Test), it needs to set temperature=0.0

But I get unexpected error like

lm_eval --model local-chat-completions --model_args model=llama-2-7b-chat-hf,base_url=http://localhost:8000/v1 --task gsm8k 2024-03-12:16:09:56,344 INFO [main.py:225] Verbosity set to INFO 2024-03-12:16:09:56,344 INFO [init.py:373] lm_eval.tasks.initialize_tasks() is deprecated and no longer necessary. It will be removed in v0.4.2 release. TaskManager will instead be used. 2024-03-12:16:10:01,070 INFO [main.py:311] Selected Tasks: ['gsm8k'] 2024-03-12:16:10:01,070 INFO [main.py:312] Loading selected tasks... 2024-03-12:16:10:01,075 INFO [evaluator.py:129] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 2024-03-12:16:10:01,419 INFO [evaluator.py:190] get_task_dict has been updated to accept an optional argument, task_managerRead more here:https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md#external-library-usage 2024-03-12:16:10:17,655 INFO [task.py:395] Building contexts for gsm8k on rank 0... 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:06<00:00, 192.73it/s] 2024-03-12:16:10:24,524 INFO [evaluator.py:357] Running generate_until requests 0%| 2024-03-12:16:11:08,170 INFO [_client.py:1026] HTTP Request: POST http://localhost:8000/v1/chat/completions "HTTP/1.1 500 Internal Server Error" 2024-03-12:16:11:08,171 INFO [_base_client.py:952] Retrying request to /chat/completions in 0.788895 seconds 2024-03-12:16:11:09,010 INFO [_client.py:1026] HTTP Request: POST http://localhost:8000/v1/chat/completions "HTTP/1.1 500 Internal Server Error" 2024-03-12:16:11:09,011 INFO [_base_client.py:952] Retrying request to /chat/completions in 1.621023 seconds 2024-03-12:16:11:10,683 INFO [_client.py:1026] HTTP Request: POST http://localhost:8000/v1/chat/completions "HTTP/1.1 500 Internal Server Error" Traceback (most recent call last): File "/home/yutianchen/Project/lm-evaluation-harness/lm_eval/models/utils.py", line 333, in wrapper return func(args, kwargs) File "/home/yutianchen/Project/lm-evaluation-harness/lm_eval/models/openai_completions.py", line 75, in completion return client.chat.completions.create(kwargs) File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_utils/_utils.py", line 303, in wrapper return func(args, **kwargs) File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/resources/chat/completions.py", line 598, in create return self._post( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 1088, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 853, in request return self._request( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 916, in _request return self._retry_request( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 958, in _retry_request return self._request( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 916, in _request return self._retry_request( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 958, in _retry_request return self._request( File "/home/yutianchen/anaconda3/envs/llm-eval/lib/python3.9/site-packages/openai/_base_client.py", line 930, in _request raise self._make_status_error_from_response(err.response) from None openai.InternalServerError: Error code: 500 - {'generated_text': None, 'num_input_tokens': None, 'num_input_tokens_batch': None, 'num_generated_tokens': None, 'num_generated_tokens_batch': None, 'preprocessing_time': None, 'generation_time': None, 'timestamp': 1710259870.67887, 'finish_reason': None, 'error': {'object': 'error', 'message': 'Internal Server Error', 'internal_message': 'Internal Server Error', 'type': 'InternalServerError', 'param': {}, 'code': 500}}

The error is similar when testing temperature=0.0 using llm-on-ray query_openai_sdk.py --model_name llama-2-7b-chat-hf --temperature 0.0

python examples/inference/api_server_openai/query_openai_sdk.py --model_name llama-2-7b-chat-hf --temperature 0.0 Traceback (most recent call last): File "/home/yutianchen/Project/latest_lib/llm-on-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 98, in for i in chunk_chat(): File "/home/yutianchen/Project/latest_lib/llm-on-ray/examples/inference/api_server_openai/query_openai_sdk.py", line 75, in chunk_chat output = client.chat.completions.create( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_utils/_utils.py", line 275, in wrapper return func(*args, **kwargs) File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/resources/chat/completions.py", line 663, in create return self._post( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 1200, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 889, in request return self._request( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 965, in _request return self._retry_request( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 1013, in _retry_request return self._request( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 965, in _request return self._retry_request( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 1013, in _retry_request return self._request( File "/home/yutianchen/anaconda3/envs/llm-on-ray/lib/python3.9/site-packages/openai/_base_client.py", line 980, in _request raise self._make_status_error_from_response(err.response) from None openai.InternalServerError: Error code: 500 - {'generated_text': None, 'num_input_tokens': None, 'num_input_tokens_batch': None, 'num_generated_tokens': None, 'num_generated_tokens_batch': None, 'preprocessing_time': None, 'generation_time': None, 'timestamp': 1710260245.4014304, 'finish_reason': None, 'error': {'object': 'error', 'message': 'Internal Server Error', 'internal_message': 'Internal Server Error', 'type': 'InternalServerError', 'param': {}, 'code': 500}}

Both LLaMA https://github.com/facebookresearch/llama/issues/687 and Transformers https://github.com/huggingface/transformers/pull/25722 officials suggest to ”set do_sample = False in case temperature = 0“ image

But openai api’s client.chat.completions.create(**kwargs) It does not support the do_sample parameter, and there is no suitable args to solve the problem of temperature=0.0

KepingYan commented 3 months ago

Please also paste the log of server.