huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Multinode minimal example #115

Closed staghado closed 7 months ago

staghado commented 7 months ago

Was the code tested on multinode? If yes, could you provide a minimal example?

For now the training works on single node but In multinode it just hangs at:

[Start training] datetime: 2024-03-22 19:46:52.930585 | mbs: 1 | grad_accum: 8 | global_batch_size: 8 | sequence_length: 4096 | train_steps: 5 | start_iteration_step: 0 | consumed_train_samples: 0

I am using the LLaMA example.

xrsrke commented 7 months ago

@staghado, hey, yes, Nanotron is tested on multinodal (indeed, BigCode 7B was trained using Nanotron). What command did you use to run?

staghado commented 7 months ago

Hey @xrsrke, I'm using the torchrun command and slurm:

MODEL_ARGS="\
--config-file examples/config_llama.yaml
"

torchrun \
    --nnodes=$SLURM_NTASKS \
    --node_rank=$SLURM_NODEID \
    --nproc_per_node=$NPROC \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    run_train.py \
    ${MODEL_ARGS}

Do you have an example of how BigCode 7B was run?

staghado commented 7 months ago

any updates on this?

Lauler commented 7 months ago

Here is a guide I wrote to get it running multinode on HPC with SLURM. Follow along with the README. The launch script looks like this.

Stas Bekkman has good templates for launching multi-node jobs according to different templates: https://github.com/stas00/ml-engineering/tree/master/orchestration/slurm/launchers

How you launch a job can depend on your cluster and what job manager you're running (if at all running any). Since you're using SLURM you should be able to follow along with my guide.

Also, remember to adjust your data parallelism (dp) setting in the config if you increase the number of nodes.

staghado commented 7 months ago

Thanks a lot! This is exactly what I was looking for. 🔥

ai-jz commented 7 months ago

@staghado, hey, yes, Nanotron is tested on multinodal (indeed, BigCode 7B was trained using Nanotron).

Great to know this! I think you can update the https://huggingface.co/bigcode/starcoderbase-7b, where it lists Megatron-LM as the repo. Nanotron deserves the credit.

Btw, is nanoton also used for production, e.g., for https://huggingface.co/training-cluster?