Closed takitsuba closed 1 year ago
Sorry, it seems that the slow communication speed between GPUs (without NVLink) was the cause of this issue. I will try again after improving the speed. I apologize for the inconvenience.
@takitsuba made some updates on llama2-wrapper==0.1.9
.
File ~/Projects/test_llama2wrapper/.venv/lib/python3.11/site-packages/llama2_wrapper/model.py:363, in LLAMA2_WRAPPER.__call__(self, prompt, stream, max_new_tokens, temperature, top_p, top_k, repetition_penalty, **kwargs)
361 return streamer
362 else:
363 output_ids = self.model.generate(
364 **generate_kwargs,
365 )
366 output = self.tokenizer.decode(output_ids[0])
--> 367 return output.split("[/INST]")[1].split("</s>")[0]
now won't split output text instead using output = self.tokenizer.decode(output_ids[0][prompt_tokens_len:], skip_special_tokens=True)
to skip the input prompt in the generation.
It would be less flaky.
Multi GPUs issue is still hard to investigate. Have you tried Huggingface Text Inference on this powerful device?
Thank you for your reply! I will try 0.1.9.
I also tried Huggingface text generation and failed. I have come to feel that the fundamental cause of these issues is slow communication speed, after creating this issue🙇 Following is the result of p2pBandwidthLatencyTest. (0-1,2-3,4-5,6-7 GPU are connected with NVLink)
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3 4 5 6 7
0 2.51 2.47 49204.84 49204.84 49204.53 49204.35 49204.41 49203.94
1 2.54 2.68 49204.94 49204.97 49204.91 49204.94 49204.95 49204.91
2 49204.79 49204.79 2.34 2.45 49204.80 49204.79 49204.79 49204.80
3 49204.95 49204.93 2.56 2.45 49204.96 49204.99 49204.96 49204.91
4 49204.80 49204.90 49204.85 49204.82 2.42 2.42 49204.84 49204.80
5 49204.77 49204.73 49204.70 49204.68 2.49 2.28 49204.75 49204.74
6 49204.93 49204.83 49204.91 49204.89 49204.87 49204.88 2.27 2.41
7 49204.91 49204.85 49204.85 49204.91 49204.84 49204.84 2.51 2.26
I think my problem is not due to llama2-wrapper, so I would like to close this issue.
I cannot run
Llama-2-70b-hf
. The backend type is transformers. I tried to use multiple GPUs. If anyone knows how to solve this problem, please let me know.sample code
Error messages
version
notes
This huggingface/transformers issue may be related to this issue.