hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
https://arxiv.org/abs/2402.02057
Apache License 2.0
1.11k stars 65 forks source link

How to use LADE in single-node multi-process way? #57

Open sjrrr13 opened 5 months ago

sjrrr13 commented 5 months ago

I've tried to load LADE distributively with

CUDA_VISIBLE_DEVICES=0,1,2,3 \
USE_LADE=1 LOAD_LADE=1 DIST_WORKERS=4 \
python -m torch.distributed.launch minimal.py

However, when I try to monitor GPU usage with watch nvidia-smi, I've found that only gpu:0 was used. I want to use Llama-2-70b-hf and it can't be loaded in only one GPU. What can I do to use all the GPUs? Is there any problem in my launch command?

Viol2000 commented 4 months ago

minimal.py does not support single-node- multi-process Please check applications/run_mtbench.sh for examples, thank you!

Viol2000 commented 4 months ago

Maybe minimal.py can also support. Please set torch_device="auto" in the code and not changing DIST_WORKERS=4 and just use python minimal.py