How to run GPipe in distributed manner?

Allen-Czyysx commented 3 years ago

I was able to run GPipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM in a single node with multiple GPUs.

However, I am a little confused about how to run GPipe with multiple nodes (or workers termed in Lingvo's configuration). I am not even sure if the current version supports distributed pipeline parallelism.

If, hopefully, Lingvo does support distributed pipeline parallelism, I would appreciate it if there are any tutorials or examples to help me to configure distributed GPipe :)

siddharth-krishna commented 3 years ago

Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?

xsppp commented 3 years ago

I was able to run GPipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM in a single node with multiple GPUs.

However, I am a little confused about how to run GPipe with multiple nodes (or workers termed in Lingvo's configuration). I am not even sure if the current version supports distributed pipeline parallelism.

If, hopefully, Lingvo does support distributed pipeline parallelism, I would appreciate it if there are any tutorials or examples to help me to configure distributed GPipe :)

Hi, when I try to run one_billion_wds using GPipe, I find a problem with input generation #250. Do you have the same problem and how you fix it?

adis98 commented 3 years ago

Any update on how to run the model on local gpus?

adis98 commented 3 years ago

Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?

You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present

xsppp commented 3 years ago

Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?

You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present

I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?

adis98 commented 3 years ago

Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?

You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present

I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?

Hi. I did get an error iirc. It failed when I gave a different value for the worker split and worker gpus, so try giving the same value. Also, pass --enable_asserts=false as an additional parameter. Let me know if it still fails.

xsppp commented 3 years ago

Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?

You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present

I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?

Hi. I did get an error iirc. It failed when I gave a different value for the worker split and worker gpus, so try giving the same value. Also, pass --enable_asserts=false as an additional parameter. Let me know if it still fails.

Thanks a lot I will try it

Allen-Czyysx commented 3 years ago

Sorry for my late reply. FYI, Here is my command to run on local GPUs: ~/lingvo/bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM --logdir=/tmp/lm/log --logtostderr --worker_split_size=2 --worker_gpus=2 --batch_size=8 --micro_num=4 And I downloaded the dataset according to https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark.

I remember there were some errors I overcomed, but nothing like #250. So sorry I can't help you...

BTW, I am pretty sure that GPipe does not support distributed execution yet :/

tensorflow / lingvo

How to run GPipe in distributed manner? #245