orellavie1212 commented 1 month ago

Your current environment

sagemaker ml.g5.12xlarge instance (4 instances of a10g 24gb) container is 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 from https://github.com/aws/deep-learning-containers/blob/master/available_images.md

🐛 Describe the bug

from vllm import LLM, SamplingParams question = "what is the id of the team and what is the subtitute lineup of the home team for the match?" history = str(["how many games the home team Sevilla won?"]) full_example = f""" \n \n\nYou are a transformation helper specialist, based on the history helping in transforming user input\nto a more structured and simpler text to a smaller model, which is less smart as you. \n\nMost of the times, the history could help you about entities which are now missing \nfrom the question\nTo illustrate the mission, if the user asked in the history about an entity (like \'Barcelona\'), and \nnow he asked about \'team\' (could be team, player, or other entity) or it seems to you that the an entity \nis missing in the context, perhaps the entity (\'Barcelona\') from the history could be the option to fill the gap. \n\nIf there is no entity in the history, please do not hallucinate and offer weird entity, for example if in the history\nyou saw \'home team\' and now he just mentioned \'team\', replace \'team\' with \'home team\' (applicable for away team too).\n\nWhen a replacement is occurred, please do not add \'the\' as part of the entity, just entity itself.\n\n \nReturn it in valid JSON format according to the schema\n \n\n User question: \n\n {question} \n\n History: \n\n {history}"""

prompts = [ full_example ] sampling_params = SamplingParams(temperature=0, top_p=0.95, max_new_tokens=256)

llm = LLM(model="TechxGenus/Meta-Llama-3-70B-Instruct-GPTQ", 'tokenizer_mode'='auto', 'gpu_memory_utilization'=0.7, 'guided_decoding_backend' ='lm-format-enforcer', tensor_parallel_size=4)

quantization not mentioned in order to get marlin kernels instead of standard gptq (if available).

outputs = llm.generate(prompts, sampling_params)

Print the outputs.

for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

the output for version 0.4.1 is right "what is the id of Sevilla and what is the substitute lineup of Sevilla for the match?"

for all versions forward (0.4.2 and above) it returns weird answers (which does not make sense. like 1900 1900 1900 1900 and weird tokens in the answers)

Notes: 1.As a side check I ran everything the same, but with TechxGenus/Meta-Llama-3-8B-Instruct-AWQ, and it worked. As I suspect, when the shard is above the size of the gpu memory (24gb not enough for 70b gptq, compared to 8b awq), and a tp is needed, it does affect (maybe in v0.4.2 and forward you changed something in the code of the tensor parallel).

As another note, ran the same llama 70b gptq on 1 card v6000 48gb, everything worked great even with v0.5.1 (ofc 0.4.2 and up), so it almost 100% something with the megatron TP.

If another data needed, just comment and mention me. Thanks

orellavie1212 commented 1 month ago

updated as it seems I found out that the problem is not lied in tp itself, but with gptq marlin kernels. gptq kernels working great (quantization='gptq', instead of let gptq_marlin right away without mentioning anything)

robertgshaw2-neuralmagic commented 1 month ago

Thanks for reporting the issue, we will investigate

orellavie1212 commented 1 month ago

Thanks for reporting the issue, we will investigate

I hope you saw my update right now, after deep dive with the tp code, and not found anything meaningful, the change from marlin kernels to regular gptq kernels works great with tp of 4. (quantization='gptq')

robertgshaw2-neuralmagic commented 1 month ago

Yeah, something is off with marlin here

robertgshaw2-neuralmagic commented 1 month ago

@alexm-neuralmagic

alexm-neuralmagic commented 1 month ago

@orellavie1212 could you try this PR https://github.com/vllm-project/vllm/pull/6267 and see if it fixes your issue.

alexm-neuralmagic commented 1 month ago

@orellavie1212 would be helpful if you could verify the above PR fixes your issue before we actually land it.

orellavie1212 commented 1 month ago

@orellavie1212 would be helpful if you could verify the above PR fixes your issue before we actually land it.

sorry, didn't get any update via the email as I am regularly get, hope to check soon.

alexm-neuralmagic commented 1 month ago

@orellavie1212 thanks, keep us updated

orellavie1212 commented 1 month ago

@orellavie1212 thanks, keep us updated

Hey, do you want me to test it on your git (install vllm from your git branch, the one you forked), or to apply the changes to the main in vllm (merge it locally, like in my fork, and install it, to be sure vllm works in general with the code)?, as I saw your git is v0.5.0 (the specific branch), and it is not v0.5.1. what is your choice?

alexm-neuralmagic commented 1 month ago

I think you can apply the patch to your latest local vllm (since it is basically 3-4 lines of code)

orellavie1212 commented 1 month ago

I think you can apply the patch to your latest local vllm (since it is basically 3-4 lines of code)

checked with this (fork of today's vllm main, f7160d946a0a07703e72d81ba9ecf3913f192605) https://github.com/orellavie1212/vllm.git combined with this https://github.com/neuralmagic/nm-vllm/blob/marlin_bug_2/csrc/quantization/gptq_marlin/gptq_marlin.cu

some of the answers fixed (although, it gives now other answers compared to gptq without marlin, which were the same as gptq marlin without tp of 4, but tp of 1) Some of the answers still not working, as the one in the example it replaces a whole sentence to one word or letter. IDK if you can call it fixed :( Update: checked the same fork with gptq (but without marlin), works great

alexm-neuralmagic commented 1 month ago

Could you please try gpu_memory_utilization==0.9 and see if it helps

orellavie1212 commented 1 month ago

Could you please try gpu_memory_utilization==0.9 and see if it helps

0.9 getting out of memory. 0.7 is the max I could get

alexm-neuralmagic commented 1 month ago

@orellavie1212 try this PR https://github.com/vllm-project/vllm/pull/6795

vllm-project / vllm

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq) #6258

Your current environment

🐛 Describe the bug

Print the outputs.