mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.59k stars 553 forks source link

GNMT training how to use multi-GPUs ? #438

Closed alphaRGB closed 1 year ago

alphaRGB commented 3 years ago

I am not familiar with docker, so I installed pytorch==1.7.0 by Anaconda

# workspace
cd training/rnn_translator/pytorch
# run training
bash run_and_time.sh

output

2021-01-18 20:23:41 - INFO - 0 - Saving results to: results/gnmt
2021-01-18 20:23:41 - INFO - 0 - Run arguments: Namespace(batching='bucketing', beam_size=5, cov_penalty_factor=0.1, cuda=True, cudnn=True, dataset_dir='/home/srcxfim/SRCXFIM/penghui_wei/FIM/MLPerf_training/training/rnn_translator/data', decay_factor=0.5, decay_interval=None, decay_steps=4, dropout=0.2, env=False, epochs=8, eval=True, grad_clip=5.0, hidden_size=1024, intra_epoch_eval=0, keep_checkpoints=0, len_norm_const=5.0, len_norm_factor=0.6, local_rank=0, lr=0.001, math='fp32', max_length_test=150, max_length_train=50, max_length_val=125, max_size=None, min_length_test=0, min_length_train=0, min_length_val=0, num_buckets=5, num_layers=4, optimizer='Adam', optimizer_extra='{}', print_freq=10, rank=0, remain_steps=0.666, results_dir='results', resume=None, save='gnmt', save_all=False, save_freq=5000, save_path='results/gnmt', seed=1, shard_size=80, share_embedding=True, smoothing=0.1, start_epoch=0, target_bleu=24.0, test_batch_size=128, test_loader_workers=0, train_batch_size=128, train_global_batch_size=None, train_iter_size=1, train_loader_workers=2, val_batch_size=64, val_loader_workers=0, warmup_steps=200)
2021-01-18 20:23:41 - INFO - 0 - Using master seed from command line: 1
2021-01-18 20:23:41 - INFO - 0 - Worker 0 is using worker seed: 3280387012

2021-01-18 20:23:59 - INFO - 0 - Sampler for epoch 0 uses seed 1095513148
2021-01-18 20:24:00 - INFO - 0 - TRAIN [0][0/27326] Time 1.391 (0.000)  Data 6.63e-01 (0.00e+00)    Tok/s 7406 (0)  Loss/tok 10.6162 (10.6162)  LR 1.023e-05
2021-01-18 20:24:05 - INFO - 0 - TRAIN [0][10/27326]    Time 0.285 (0.480)  Data 2.00e-04 (2.35e-04)    Tok/s 12341 (13679) Loss/tok 9.9203 (10.2954)   LR 1.288e-05
2021-01-18 20:24:10 - INFO - 0 - TRAIN [0][20/27326]    Time 0.286 (0.458)  Data 3.26e-04 (2.34e-04)    Tok/s 12549 (13591) Loss/tok 9.4140 (10.0565)   LR 1.622e-05
2021-01-18 20:24:15 - INFO - 0 - TRAIN [0][30/27326]    Time 0.554 (0.469)  Data 1.83e-04 (2.32e-04)    Tok/s 14472 (13633) Loss/tok 9.3890 (9.8618)    LR 2.042e-05
2021-01-18 20:24:18 - INFO - 0 - TRAIN [0][40/27326]    Time 0.289 (0.444)  Data 1.77e-04 (2.26e-04)    Tok/s 12274 (13453) Loss/tok 8.9960 (9.7353)    LR 2.570e-05

There are 8 NVIDIA GPUs in my machine, but I found that only 1 GPU be used, other GPUs not be used. I found that GNMT of this repo may support multi-GPUs:

distributed = parser.add_argument_group('distributed setup')

So I want to know how to use multi-GPU training?

alphaRGB commented 3 years ago

Run in docker

README:

one can control which GPUs are used with the NV_GPU variable sudo NV_GPU=0 nvidia-docker run -it --rm --ipc=host \ -v $(pwd)/../data:/data \ gnmt:latest "./run_and_time.sh" $SEED |tee benchmark-$NOW.log

I try to use 3 GPUs, but ony GPU_id=3 works

sudo NV_GPU=3,4,5 nvidia-docker run -it --rm --ipc=host \
  -v $(pwd)/../data:/data \
  gnmt:latest "./run_and_time.sh" $SEED |tee benchmark-$NOW.log
adeemjassani commented 3 years ago

@alphaRGB Did you find a solution?

johntran-nv commented 1 year ago

Closing because GNMT is deprecated from the benchmark suite.