Closed orellavie1212 closed 1 month ago
updated as it seems I found out that the problem is not lied in tp itself, but with gptq marlin kernels. gptq kernels working great (quantization='gptq', instead of let gptq_marlin right away without mentioning anything)
Thanks for reporting the issue, we will investigate
Thanks for reporting the issue, we will investigate
I hope you saw my update right now, after deep dive with the tp code, and not found anything meaningful, the change from marlin kernels to regular gptq kernels works great with tp of 4. (quantization='gptq')
Yeah, something is off with marlin here
@alexm-neuralmagic
@orellavie1212 could you try this PR https://github.com/vllm-project/vllm/pull/6267 and see if it fixes your issue.
@orellavie1212 would be helpful if you could verify the above PR fixes your issue before we actually land it.
@orellavie1212 would be helpful if you could verify the above PR fixes your issue before we actually land it.
sorry, didn't get any update via the email as I am regularly get, hope to check soon.
@orellavie1212 thanks, keep us updated
@orellavie1212 thanks, keep us updated
Hey, do you want me to test it on your git (install vllm from your git branch, the one you forked), or to apply the changes to the main in vllm (merge it locally, like in my fork, and install it, to be sure vllm works in general with the code)?, as I saw your git is v0.5.0 (the specific branch), and it is not v0.5.1. what is your choice?
I think you can apply the patch to your latest local vllm (since it is basically 3-4 lines of code)
I think you can apply the patch to your latest local vllm (since it is basically 3-4 lines of code)
checked with this (fork of today's vllm main, f7160d946a0a07703e72d81ba9ecf3913f192605) https://github.com/orellavie1212/vllm.git combined with this https://github.com/neuralmagic/nm-vllm/blob/marlin_bug_2/csrc/quantization/gptq_marlin/gptq_marlin.cu
some of the answers fixed (although, it gives now other answers compared to gptq without marlin, which were the same as gptq marlin without tp of 4, but tp of 1) Some of the answers still not working, as the one in the example it replaces a whole sentence to one word or letter. IDK if you can call it fixed :( Update: checked the same fork with gptq (but without marlin), works great
Could you please try gpu_memory_utilization==0.9 and see if it helps
Could you please try gpu_memory_utilization==0.9 and see if it helps
0.9 getting out of memory. 0.7 is the max I could get
@orellavie1212 try this PR https://github.com/vllm-project/vllm/pull/6795
Your current environment
sagemaker ml.g5.12xlarge instance (4 instances of a10g 24gb) container is 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 from https://github.com/aws/deep-learning-containers/blob/master/available_images.md
🐛 Describe the bug
from vllm import LLM, SamplingParams question = "what is the id of the team and what is the subtitute lineup of the home team for the match?" history = str(["how many games the home team Sevilla won?"]) full_example = f""" \n \n\nYou are a transformation helper specialist, based on the history helping in transforming user input\nto a more structured and simpler text to a smaller model, which is less smart as you. \n\nMost of the times, the history could help you about entities which are now missing \nfrom the question\nTo illustrate the mission, if the user asked in the history about an entity (like \'Barcelona\'), and \nnow he asked about \'team\' (could be team, player, or other entity) or it seems to you that the an entity \nis missing in the context, perhaps the entity (\'Barcelona\') from the history could be the option to fill the gap. \n\nIf there is no entity in the history, please do not hallucinate and offer weird entity, for example if in the history\nyou saw \'home team\' and now he just mentioned \'team\', replace \'team\' with \'home team\' (applicable for away team too).\n\nWhen a replacement is occurred, please do not add \'the\' as part of the entity, just entity itself.\n\n \nReturn it in valid JSON format according to the schema\n \n\n User question: \n\n {question} \n\n History: \n\n {history}"""
prompts = [ full_example ] sampling_params = SamplingParams(temperature=0, top_p=0.95, max_new_tokens=256)
llm = LLM(model="TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ", 'tokenizer_mode'='auto', 'gpu_memory_utilization'=0.7, 'guided_decoding_backend' ='lm-format-enforcer', tensor_parallel_size=4)
quantization not mentioned in order to get marlin kernels instead of standard gptq (if available).
outputs = llm.generate(prompts, sampling_params)
Print the outputs.
for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
the output for version 0.4.1 is right "what is the id of Sevilla and what is the substitute lineup of Sevilla for the match?"
for all versions forward (0.4.2 and above) it returns weird answers (which does not make sense. like 1900 1900 1900 1900 and weird tokens in the answers)
Notes: 1.As a side check I ran everything the same, but with TechxGenus/Meta-Llama-3-8B-Instruct-AWQ, and it worked. As I suspect, when the shard is above the size of the gpu memory (24gb not enough for 70b gptq, compared to 8b awq), and a tp is needed, it does affect (maybe in v0.4.2 and forward you changed something in the code of the tensor parallel).
If another data needed, just comment and mention me. Thanks