Open DTDwind opened 1 year ago
I have some internal code for running the training on Slurm. I'll try to commit them in a PR later.
@DTDwind I use FastChat on slurm for a few months, but only one node per model. So I can't run Falcon-180, for example. Do you still need help with this issue? If yes, I can show how I do it.
If you managed to run multi-node, please let me know, I have a couple thousand gpus here waiting :-)
@surak Hi~ I'm currently not actively working on this project, but I'm still exploring various possibilities with Slurm. I believe your insights and tips on using Slurm will be very helpful for me. So, I welcome and appreciate your sharing of how to do it.
@DTDwind I do like this:
I have a general settings file, where I set the controller host and a couple settings I need for every model:
I call the whole thing "Blablador", so you will see this word a lot. I also don't do the pip install -e .
thing, because this way I can change things and see results right away, so you will see that I call the python scripts directly.
# Get the full path for the directory where this script is located
cd $( dirname -- "$0" )
export BLABLADOR_DIR="$(pwd)/FastChat"
export LOGDIR=$BLABLADOR_DIR/logs
export NCCL_P2P_DISABLE=1 # 3090s do not support p2p
export BLABLADOR_CONTROLLER=http://compute-node1.local
export BLABLADOR_CONTROLLER_PORT=21001
For the controller, it's just a normal invocation on the compute node 1, no slurm:
#/bin/bash
cd $BLABLADOR_DIR
source sc_venv_template/activate.sh
# I can leave it open to 0.0.0.0 as this host is not reachable from the internet
python3 fastchat/serve/controller.py --host 0.0.0.0
The web server is also a shell script on another host, also only visible on our intranet:
#/bin/bash
source config-blablador.sh
cd $BLABLADOR_DIR
source sc_venv_template/activate.sh
python3 fastchat/serve/gradio_web_server.py \
--share \
--model-list-mode=reload \
--host 0.0.0.0 \
The openAi api server can also run in any VM:
#/bin/bash
source config-blablador.sh
cd $BLABLADOR_DIR
source sc_venv_template/activate.sh
python3 fastchat/serve/openai_api_server.py --host 0.0.0.0 --port 8000
And this is an example of one of the models I'm running. This needs 7 gpus:
#!/bin/bash
#SBATCH --job-name=Marcoroni-70B
#SBATCH --output=/data/blablador/logs/%j.txt
#SBATCH --error=/data/blablador/logs/%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=100:00:00 # I like to release the machine every 4 days - gives me time to reevaluate which models I run
#SBATCH --gres=gpu:7
echo "I AM ON "$(hostname) " running Marcoroni-70B on 7 gpus"
export BLABLADOR_DIR="/data/FastChat" # gotta hardcode it here unfortunately
source $BLABLADOR_DIR/config-blablador.sh
cd $BLABLADOR_DIR
source $BLABLADOR_DIR/sc_venv_template/activate.sh
srun python3 $BLABLADOR_DIR/fastchat/serve/model_worker.py \
--controller $BLABLADOR_CONTROLLER:$BLABLADOR_CONTROLLER_PORT \
--port 31028 --worker http://$(hostname):31028 \
--num-gpus 7 \
--host 0.0.0.0 \
--model-path /data/FastChat/models/Marcoroni-70B \
Every model runs on any compute node, and they can freely talk to the controller. I can fit up to 8 small models in a node with 8 gpus. I don't share gpus among jobs on my clusters. What I pay attention to is that each models has a different port. I just add a new port for each new model I add here, so I guess I have 28 of them now :-)
Hope it helps!
Thank you for sharing, they are very helpful.
I will close this one, then, so we have less issues to worry about, ok? Feel free to either contact me directly or reopen this one if you feel it’s not answered well enough.
I have some internal code for running the training on Slurm. I'll try to commit them in a PR later.
Where is it? I can't run it multi-gpu, much less multi-node.
Hello,
is it possible to implement model parallelism on the Slurm clustering system?
If so, could you provide some guidance on how to configure it?
If not, would it be possible for you to consider adding support for this technology to the clustering system?
I believe clustering is one of the important technologies for addressing GPU memory limitations.