nebuly-ai / nebuly

The user analytics platform for LLMs
https://www.nebuly.com/
Apache License 2.0
8.37k stars 647 forks source link

for rlhf_accelerate branch, can't run with multiGPU #288

Open balcklive opened 1 year ago

balcklive commented 1 year ago

my ~/.cache/huggingface/accelerate/default_config.yaml is: compute_environment: LOCAL_MACHINE deepspeed_config: {} distributed_type: MULTI_GPU downcast_bf16: 'no' dynamo_config: {} fsdp_config: {} gpu_ids: all machine_rank: 0 main_training_function: main megatron_lm_config: {} mixed_precision: 'no' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

my start training command is: accelerate launch --multi_gpu /home/ubuntu/ubuntu/artifacts/main.py /home/ubuntu/ubuntu/artifacts/config/config.yaml --type REWARD

and then I got this:

1679563172948

It seems that the program are using the same GPU, how is that happend? For the old version of main branch, I did succed run with multiGPU using this command.

PierpaoloSorbellini commented 1 year ago

Hi @balcklive thanks for reaching out! We are currently working on the matter, we will get back to you as soon as we have a fix!

PierpaoloSorbellini commented 1 year ago

Hi @balcklive, The new PR #306 should have fixed this problem! Remember to start the training using deepspeed or accelerate launch instead of python (more on this in the readme of the linked PR) and enable one of them in the config.yaml. Sorry for the delay