vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.08k stars 4.34k forks source link

[Bug]: Discrepancy in vLLM and LoRA Adapter Scores with Different Package Versions #6800

Open pratcooper opened 3 months ago

pratcooper commented 3 months ago

Your current environment

Packages used for both finetuning and inference (vllm==0.3.2):

torch==2.1.2 accelerate==0.27.2 transformers==4.40.1 sentence_transformers==2.7.0

Description: With the above package versions, the vLLM scores do not match those of the LoRA adapter.

LoRA Scoring Code: with torch.no_grad(): generation_output = self.model.generate( input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=True, output_scores=True, max_new_tokens=max_new_tokens ) s = generation_output.sequences[0] output = self.tokenizer.decode(s, skip_special_tokens=True)

vLLM Scoring Code: self._model = LLM(self._base_model_path, tensor_parallel_size=self.number_of_gpu, gpu_memory_utilization=self.gpu_memory_utilization, enable_lora=True) prompts = self.prompter.generate_prompts(instructions, inputs) sampling_params = SamplingParams(temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_new_tokens, use_beam_search=use_beam_search, best_of=best_of) adaptor_id = self.lora_adapters.get_adapter_id(adaptor_name) adaptor_path = self.lora_adapters.get_adapter_path(adaptor_name) outputs = self._model.generate( prompts, sampling_params, lora_request=LoRARequest(adaptor_name, adaptor_id, adaptor_path) )

Observed Behavior: When using the initial set of packages, the scoring results between vLLM and LoRA adapter differ significantly. However, when the package versions are changed to below for finetuning/scoring on LORA end:

torch: 2.0.0+cu117 transformers: 4.31.0 sentence-transformers: 2.2.2 accelerate: 0.20.3

the match rate between vLLM (0.3.2) and LoRA increases to over 99%.

Question: Is there any caching mechanism in the vLLM code that might be causing this discrepancy when different versions of torch, transformers, sentence-transformers, and accelerate are used? If so, how can we ensure consistent scoring results across different package versions?

A100 GPU and CUDA version 10.0.1 is used for vLLM inference.

🐛 Describe the bug

Discrepancy in vLLM and LoRA Adapter Scores with Different Package Versions

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!