lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.82k stars 4.54k forks source link

slurm clustering #166

Open DTDwind opened 1 year ago

DTDwind commented 1 year ago

Hello,

is it possible to implement model parallelism on the Slurm clustering system?

If so, could you provide some guidance on how to configure it?

If not, would it be possible for you to consider adding support for this technology to the clustering system?

I believe clustering is one of the important technologies for addressing GPU memory limitations.

zhisbug commented 1 year ago

I have some internal code for running the training on Slurm. I'll try to commit them in a PR later.

surak commented 1 year ago

@DTDwind I use FastChat on slurm for a few months, but only one node per model. So I can't run Falcon-180, for example. Do you still need help with this issue? If yes, I can show how I do it.

If you managed to run multi-node, please let me know, I have a couple thousand gpus here waiting :-)

DTDwind commented 1 year ago

@surak Hi~ I'm currently not actively working on this project, but I'm still exploring various possibilities with Slurm. I believe your insights and tips on using Slurm will be very helpful for me. So, I welcome and appreciate your sharing of how to do it.

surak commented 1 year ago

@DTDwind I do like this:

I have a general settings file, where I set the controller host and a couple settings I need for every model:

I call the whole thing "Blablador", so you will see this word a lot. I also don't do the pip install -e . thing, because this way I can change things and see results right away, so you will see that I call the python scripts directly.

config-blablador.sh

# Get the full path for the directory where this script is located
cd $( dirname -- "$0" ) 
export BLABLADOR_DIR="$(pwd)/FastChat"
export LOGDIR=$BLABLADOR_DIR/logs
export NCCL_P2P_DISABLE=1 # 3090s do not support p2p
export BLABLADOR_CONTROLLER=http://compute-node1.local
export BLABLADOR_CONTROLLER_PORT=21001

For the controller, it's just a normal invocation on the compute node 1, no slurm:

controller.sh

#/bin/bash

cd $BLABLADOR_DIR
source sc_venv_template/activate.sh

# I can leave it open to 0.0.0.0 as this host is not reachable from the internet
python3 fastchat/serve/controller.py --host 0.0.0.0 

The web server is also a shell script on another host, also only visible on our intranet:

web.sh

#/bin/bash

source config-blablador.sh
cd $BLABLADOR_DIR
source sc_venv_template/activate.sh

python3 fastchat/serve/gradio_web_server.py \
        --share \
        --model-list-mode=reload \
        --host 0.0.0.0 \

The openAi api server can also run in any VM:

api.sh

#/bin/bash

source config-blablador.sh
cd $BLABLADOR_DIR
source sc_venv_template/activate.sh

python3 fastchat/serve/openai_api_server.py --host 0.0.0.0 --port 8000

And this is an example of one of the models I'm running. This needs 7 gpus:

marcoroni-70.slurm

#!/bin/bash
#SBATCH --job-name=Marcoroni-70B
#SBATCH --output=/data/blablador/logs/%j.txt
#SBATCH --error=/data/blablador/logs/%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=100:00:00 # I like to release the machine every 4 days - gives me time to reevaluate which models I run
#SBATCH --gres=gpu:7

echo "I AM ON "$(hostname) " running Marcoroni-70B on 7 gpus"

export BLABLADOR_DIR="/data/FastChat" # gotta hardcode it here unfortunately
source $BLABLADOR_DIR/config-blablador.sh

cd $BLABLADOR_DIR
source $BLABLADOR_DIR/sc_venv_template/activate.sh

srun python3 $BLABLADOR_DIR/fastchat/serve/model_worker.py \
     --controller $BLABLADOR_CONTROLLER:$BLABLADOR_CONTROLLER_PORT \
     --port 31028 --worker http://$(hostname):31028 \
     --num-gpus 7 \
     --host 0.0.0.0 \
     --model-path /data/FastChat/models/Marcoroni-70B \

Every model runs on any compute node, and they can freely talk to the controller. I can fit up to 8 small models in a node with 8 gpus. I don't share gpus among jobs on my clusters. What I pay attention to is that each models has a different port. I just add a new port for each new model I add here, so I guess I have 28 of them now :-)

Hope it helps!

DTDwind commented 1 year ago

Thank you for sharing, they are very helpful.

surak commented 1 year ago

I will close this one, then, so we have less issues to worry about, ok? Feel free to either contact me directly or reopen this one if you feel it’s not answered well enough.

surak commented 10 months ago

I have some internal code for running the training on Slurm. I'll try to commit them in a PR later.

Where is it? I can't run it multi-gpu, much less multi-node.