for rlhf_accelerate branch, can't run with multiGPU

balcklive commented 1 year ago

my ~/.cache/huggingface/accelerate/default_config.yaml is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_config: {} fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

my start training command is: accelerate launch --multi_gpu /home/ubuntu/ubuntu/artifacts/main.py /home/ubuntu/ubuntu/artifacts/config/config.yaml --type REWARD

and then I got this:

It seems that the program are using the same GPU, how is that happend? For the old version of main branch, I did succed run with multiGPU using this command.

PierpaoloSorbellini commented 1 year ago

Hi @balcklive thanks for reaching out! We are currently working on the matter, we will get back to you as soon as we have a fix!

PierpaoloSorbellini commented 1 year ago

Hi @balcklive, The new PR #306 should have fixed this problem! Remember to start the training using deepspeed or accelerate launch instead of python (more on this in the readme of the linked PR) and enable one of them in the config.yaml. Sorry for the delay

nebuly-ai / nebuly

for rlhf_accelerate branch, can't run with multiGPU #288