Open Allen-Czyysx opened 3 years ago
Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?
I was able to run GPipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM in a single node with multiple GPUs.
However, I am a little confused about how to run GPipe with multiple nodes (or workers termed in Lingvo's configuration). I am not even sure if the current version supports distributed pipeline parallelism.
If, hopefully, Lingvo does support distributed pipeline parallelism, I would appreciate it if there are any tutorials or examples to help me to configure distributed GPipe :)
Hi, when I try to run one_billion_wds using GPipe, I find a problem with input generation #250. Do you have the same problem and how you fix it?
Any update on how to run the model on local gpus?
Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?
You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present
Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?
You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present
I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?
Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?
You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present
I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?
Hi. I did get an error iirc. It failed when I gave a different value for the worker split and worker gpus, so try giving the same value. Also, pass --enable_asserts=false as an additional parameter. Let me know if it still fails.
Could you point me to (or share) instructions on how to run the one_billion_wds using GPipe on multiple local GPUs?
You can run it using the following command: trainer --run_locally=gpu --mode=sync --model=one_billion_wds.OneBWdsGPipeTransformerWPM --worker_split_size=#give no. of gpus to use per split --worker_gpus=#tot no. of gpus present
I used the same command but there is an error #250. Do you have the same problem? If not, have you changed any code or just build the testbed and run Gpipe using this command?
Hi. I did get an error iirc. It failed when I gave a different value for the worker split and worker gpus, so try giving the same value. Also, pass --enable_asserts=false as an additional parameter. Let me know if it still fails.
Thanks a lot I will try it
Sorry for my late reply.
FYI, Here is my command to run on local GPUs:
~/lingvo/bazel-bin/lingvo/trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM --logdir=/tmp/lm/log --logtostderr --worker_split_size=2 --worker_gpus=2 --batch_size=8 --micro_num=4
And I downloaded the dataset according to https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark.
I remember there were some errors I overcomed, but nothing like #250. So sorry I can't help you...
BTW, I am pretty sure that GPipe does not support distributed execution yet :/
I was able to run GPipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM in a single node with multiple GPUs.
However, I am a little confused about how to run GPipe with multiple nodes (or workers termed in Lingvo's configuration). I am not even sure if the current version supports distributed pipeline parallelism.
If, hopefully, Lingvo does support distributed pipeline parallelism, I would appreciate it if there are any tutorials or examples to help me to configure distributed GPipe :)