Closed staghado closed 7 months ago
@staghado, hey, yes, Nanotron is tested on multinodal (indeed, BigCode 7B was trained using Nanotron). What command did you use to run?
Hey @xrsrke, I'm using the torchrun command and slurm:
MODEL_ARGS="\
--config-file examples/config_llama.yaml
"
torchrun \
--nnodes=$SLURM_NTASKS \
--node_rank=$SLURM_NODEID \
--nproc_per_node=$NPROC \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
run_train.py \
${MODEL_ARGS}
Do you have an example of how BigCode 7B was run?
any updates on this?
Here is a guide I wrote to get it running multinode on HPC with SLURM. Follow along with the README. The launch script looks like this.
Stas Bekkman has good templates for launching multi-node jobs according to different templates: https://github.com/stas00/ml-engineering/tree/master/orchestration/slurm/launchers
How you launch a job can depend on your cluster and what job manager you're running (if at all running any). Since you're using SLURM you should be able to follow along with my guide.
Also, remember to adjust your data parallelism (dp
) setting in the config if you increase the number of nodes.
Thanks a lot! This is exactly what I was looking for. 🔥
@staghado, hey, yes, Nanotron is tested on multinodal (indeed, BigCode 7B was trained using Nanotron).
Great to know this! I think you can update the https://huggingface.co/bigcode/starcoderbase-7b, where it lists Megatron-LM as the repo. Nanotron deserves the credit.
Btw, is nanoton also used for production, e.g., for https://huggingface.co/training-cluster?
Was the code tested on multinode? If yes, could you provide a minimal example?
For now the training works on single node but In multinode it just hangs at:
[Start training] datetime: 2024-03-22 19:46:52.930585 | mbs: 1 | grad_accum: 8 | global_batch_size: 8 | sequence_length: 4096 | train_steps: 5 | start_iteration_step: 0 | consumed_train_samples: 0
I am using the LLaMA example.