Multiple GPUS - Githubissues

hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

https://arxiv.org/abs/2402.02057

Apache License 2.0

1.11k stars 65 forks source link

Multiple GPUS #47

Open qspang opened 8 months ago

qspang commented 8 months ago

When I tested llama2-70b on an A800 graphics card, I encountered the problem of insufficient video memory. I want to ask how I should write the command if I want to test on two A800 graphics cards?I tried this command again：微信截图_20240120182752 But an error was reported： 1705746440612

Viol2000 commented 8 months ago

Quick fix: https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L511 in this line set DIST_WORKERS=0. I will do a refactor later to fix this thoroughly.

qspang commented 8 months ago

Quick fix: https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L511 in this line set DIST_WORKERS=0. I will do a refactor later to fix this thoroughly.

It is now running normally. Thanks to the author for his timely reply! ! !

qspang commented 8 months ago

btw,do you have code for testing accuracy like the medusa project?https://github.com/FasterDecoding/Medusa/blob/v1.0-prerelease/medusa/eval/heads_accuracy.py

Viol2000 commented 8 months ago

I did not implement such a function, but it may not be too hard to compute accuracy by dividing the number of per-step accepted tokens by the number of per-step speculation tokens.

qspang commented 8 months ago

It can run normally when USE_LADE is set to 1.When I just used multiple GPUs to test without using LADE, it turned out that it was out of memory. Is there something wrong somewhere?The following is my execution command： The error is reported as follows：

Viol2000 commented 8 months ago

When USE_LADE is set to 0, please set --use-pp=1 to use huggingface's pipeline parallelism, or you can use the deepspeed for tensor parallelism as the script I provided https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/run_mtbench.sh#L33

qspang commented 8 months ago

Fixed it!Thank you for your patient reply! If I use a smaller model such as llama2-7b-chat and set USE_LADE to 0, what will be the impact if --use-pp is also set to 0?

Viol2000 commented 8 months ago

I guess it will be put on a single gpu. Check this https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L214 and https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L250 for model placement configuration.

qspang commented 8 months ago

ok,got it!