Closed markmc closed 1 month ago
From @DarkLight1337 :
Quantization test is failing on main, starting between #17826 and #18158
17945 is the first PR where it failed. cc @Chen Zhang and @WoosukKwon
From @heheda12345
https://github.com/vllm-project/vllm/pull/18298 A minimal reproduce of the CI failure, built on top of the commit before #17945. Seems that it also triggered some previous bug in quantization like #18147 @mgoin Do you have any idea this time?
I thought I had an idea on this one. I didn't think a request with temperature=1.0
was deterministic, even with seed
set, so I tried removing that part of the test, but it didn't fix it. I'm still getting some differences in output.
diff --git a/tests/utils.py b/tests/utils.py
index bf38d7843..09285fdb3 100644
--- a/tests/utils.py
+++ b/tests/utils.py
@@ -196,6 +196,7 @@ def _test_completion(
model: str,
prompt: str,
token_ids: list[int],
+ deterministic: bool = False,
):
results = []
@@ -227,36 +228,37 @@ def _test_completion(
"usage": completion.usage,
})
- # test seeded random sampling
- completion = client.completions.create(model=model,
- prompt=prompt,
- max_tokens=5,
- seed=33,
- temperature=1.0)
-
- results.append({
- "test": "seeded_sampling",
- "text": completion.choices[0].text,
- "finish_reason": completion.choices[0].finish_reason,
- "usage": completion.usage,
- })
-
- # test seeded random sampling with multiple prompts
- completion = client.completions.create(model=model,
- prompt=[prompt, prompt],
- max_tokens=5,
- seed=33,
- temperature=1.0)
+ if not deterministic:
+ # test seeded random sampling
+ completion = client.completions.create(model=model,
+ prompt=prompt,
+ max_tokens=5,
+ seed=33,
+ temperature=1.0)
+
+ results.append({
+ "test": "seeded_sampling",
+ "text": completion.choices[0].text,
+ "finish_reason": completion.choices[0].finish_reason,
+ "usage": completion.usage,
+ })
- results.append({
- "test":
- "seeded_sampling",
- "text": [choice.text for choice in completion.choices],
- "finish_reason":
- [choice.finish_reason for choice in completion.choices],
- "usage":
- completion.usage,
- })
+ # test seeded random sampling with multiple prompts
+ completion = client.completions.create(model=model,
+ prompt=[prompt, prompt],
+ max_tokens=5,
+ seed=33,
+ temperature=1.0)
+
+ results.append({
+ "test":
+ "seeded_sampling",
+ "text": [choice.text for choice in completion.choices],
+ "finish_reason":
+ [choice.finish_reason for choice in completion.choices],
+ "usage":
+ completion.usage,
+ })
# test simple list
batch = client.completions.create(
@@ -543,7 +545,11 @@ def compare_all_settings(model: str,
})
if method == "generate":
- results += _test_completion(client, model, prompt, token_ids)
+ results += _test_completion(client,
+ model,
+ prompt,
+ token_ids,
+ deterministic=True)
elif method == "generate_close":
results += _test_completion_close(client, model, prompt)
elif method == "generate_chat":
Resolved by #18459
Your current environment
Failing on main as of commit 9609327fa4
🐛 Describe the bug
Failing tests:
Logs
``` quantization/test_torchao.py::test_opt_125m_int4wo_model_loading_with_params[cuda:0] SKIPPED quantization/test_torchao.py::test_opt_125m_int4wo_model_per_module_quant SKIPPED =================================== FAILURES =================================== _ test_load_8bit_bnb_model[meta-llama/Llama-Guard-3-8B-INT8-read pre-quantized llama 8-bit model] _ args = () kwargs = {'description': 'read pre-quantized llama 8-bit model', 'example_prompts': ['vLLM is a high-throughput and memory-effi...odels.\n', ...], 'hf_runner':