huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.53k stars 393 forks source link

Issue Running `run_sft.py` After Configuration Changes in GMAL Folder : (ChildFailedError) #162

Closed alielfilali01 closed 5 months ago

alielfilali01 commented 5 months ago

Hello Team,

Firstly, I'd like to express my appreciation for your excellent work on this. It's been an invaluable resource.

I am encountering an issue with the run_sft.py file after some modifications. In my personal fork, I created a directory named GMAL. Inside this directory, I copied the run_sft.py file from @BramVanroy's gpt2-nl folder. Find the run_sft.py file here . I made minimal changes to the configuration file, such as updating the model_path from gpt2 to mistral-7B and adjusting the dataset to my personal dataset 2A2I-R/AOT-v1, Find the config file i made, here.

Here is the command I executed on an A100 Colab instance (after installing all dependencies) :

# LoRA training on a single GPU (for more GPUs set --num_processes=n, set --load_in_4bit=true for QLoRA)
!ACCELERATE_LOG_LEVEL=info accelerate launch \
    --config_file recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 \
    recipes/GMAL/scripts/run_sft.py \
    recipes/GMAL/stage1/mistral7b_sft_config_full.yaml

Unfortunately, I am faced with the following error:

INFO:root:Using nproc_per_node=1.
/usr/bin/python3: can't open file '/content/alignment-handbook-personal-version/recipes/GMAL/scripts/run_sft.py': [Errno 2] No such file or directory
[2024-04-27 12:58:28,016] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 10592) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1014, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
recipes/GMAL/scripts/run_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
  time      : 2024-04-27_12:58:28
  host      : 8d4ff524a006
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 10592)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Could you please assist me in resolving this issue? Thank you for your time and support.

cc : @lewtun , @edbeeching , @natolambert

BramVanroy commented 5 months ago

The issue is how you have structured your repo and paths. Always look for the FIRST error that occurs. In your case:

INFO:root:Using nproc_per_node=1. /usr/bin/python3: can't open file '/content/alignment-handbook-personal-version/recipes/GMAL/scripts/run_sft.py': [Errno 2] No such file or directory

alielfilali01 commented 5 months ago

Thank you dear @BramVanroy for responding back. Yah in fact i saw that but the problem is i'm cloning my personal repo alignment-handbook-personal-version this is my code cell :

## Change to current repo
!git clone https://github.com/alielfilali01/alignment-handbook-personal-version.git
%cd ./alignment-handbook-personal-version/

!python -m pip install .

And as you can see here, the run_sft.py file exist in the specified path, see here

So I figured maybe there is an issue with accelarate or something that can't read properly my path !?

alielfilali01 commented 5 months ago

EDIT : It turned out to be a hidden space in the scripts folder, scripts instead of scripts 😄 thanks again @BramVanroy for responding back 🤗