locuslab / tofu

Landing Page for TOFU
MIT License
83 stars 18 forks source link

Issues with deepspeed #8

Closed molereddy closed 7 months ago

molereddy commented 7 months ago

While running finetune.py, I'm encountering MPI related errors because of the deepspeed='config/ds_config.json' argument.

python finetune.py --config-name=finetune.yaml split=${split} batch_size=4 gradient_accumulation_steps=4 model_family=${model} lr=${lr}
[2024-02-23 05:39:55,649] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
num_devices: 1
max_steps: 1250
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-02-23 05:39:58,704] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-23 05:39:58,704] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[gpu009:2974788] pmix_mca_base_component_repository_open: unable to open mca_gds_ds21: /work/path/.conda/envs/tofu/bin/../lib/libmca_common_dstore.so.1: undefined symbol: pmix_gds_base_modex_unpack_kval (ignored)
[gpu009:2974788] pmix_mca_base_component_repository_open: unable to open mca_gds_ds12: /work/path/.conda/envs/tofu/bin/../lib/libmca_common_dstore.so.1: undefined symbol: pmix_gds_base_modex_unpack_kval (ignored)
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[gpu009:2974788] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

How essential is deepspeed in terms of using the repository? I see that it is used only in the KL div based forget loss. There have been several issues I had to troubleshoot so far because of deepspeed related to mpi4py installs and such. In general, it does seem that I'm not alone in facing issues with deepseed, see https://www.reddit.com/r/Oobabooga/comments/13etobg/using_deepspeed_requires_lots_of_manual_tweaking/

pratyushmaini commented 7 months ago

I am sorry that you are having to face these issues. Thanks for patiently drilling through them. Deepspeed, while not critical, is the way all finetuning is done. It allows us to parallelize big models over multiple GPUs. Even though you dont see explicit reference to deepseed in the code, it is being intricately used under the hood in the trainer. Let me share with you my exact conda environment, and let us see if you can clone the environment. This may help resolve any version dependencies that are leading to this behaviour on your end.

pratyushmaini commented 7 months ago

Please find the yaml file here: environment.yml.zip

Can you run the following command to make your environment and let me know if this solves the problem? conda env create -f environment.yml

molereddy commented 7 months ago

This helped, thanks so much!