hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
https://arxiv.org/abs/2402.02057
Apache License 2.0
1.11k stars 65 forks source link

Multiple GPUS #47

Open qspang opened 8 months ago

qspang commented 8 months ago

When I tested llama2-70b on an A800 graphics card, I encountered the problem of insufficient video memory. I want to ask how I should write the command if I want to test on two A800 graphics cards?I tried this command again: 微信截图_20240120182752 But an error was reported: 1705746440612

Viol2000 commented 8 months ago

Quick fix: https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L511 in this line set DIST_WORKERS=0. I will do a refactor later to fix this thoroughly.

qspang commented 8 months ago

Quick fix: https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L511 in this line set DIST_WORKERS=0. I will do a refactor later to fix this thoroughly.

It is now running normally. Thanks to the author for his timely reply! ! !

qspang commented 8 months ago

btw,do you have code for testing accuracy like the medusa project?https://github.com/FasterDecoding/Medusa/blob/v1.0-prerelease/medusa/eval/heads_accuracy.py

Viol2000 commented 8 months ago

I did not implement such a function, but it may not be too hard to compute accuracy by dividing the number of per-step accepted tokens by the number of per-step speculation tokens.

qspang commented 8 months ago

It can run normally when USE_LADE is set to 1.When I just used multiple GPUs to test without using LADE, it turned out that it was out of memory. Is there something wrong somewhere?The following is my execution command: image The error is reported as follows: image

Viol2000 commented 8 months ago

When USE_LADE is set to 0, please set --use-pp=1 to use huggingface's pipeline parallelism, or you can use the deepspeed for tensor parallelism as the script I provided https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/run_mtbench.sh#L33

qspang commented 8 months ago

Fixed it!Thank you for your patient reply! If I use a smaller model such as llama2-7b-chat and set USE_LADE to 0, what will be the impact if --use-pp is also set to 0?

Viol2000 commented 8 months ago

I guess it will be put on a single gpu. Check this https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L214 and https://github.com/hao-ai-lab/LookaheadDecoding/blob/main/applications/eval_mtbench.py#L250 for model placement configuration.

qspang commented 8 months ago

ok,got it!