huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
10k stars 1.27k forks source link

Stack-LLaMa: OSError: Can't load tokenizer when run reward #507

Closed SeekPoint closed 1 year ago

SeekPoint commented 1 year ago

after run supervised_finetuning.py:

I got:

(gh_trl) amd00@MZ32-00:~/llm_dev/trl$ ll llama-se/final_checkpoint/
total 32828
drwxrwxr-x 2 amd00 amd00     4096 7月   9 11:13 ./
drwxrwxr-x 3 amd00 amd00     4096 7月   9 11:13 ../
-rw-rw-r-- 1 amd00 amd00      351 7月   9 11:13 adapter_config.json
-rw-rw-r-- 1 amd00 amd00 33600461 7月   9 11:13 adapter_model.bin
(gh_trl) amd00@MZ32-00:~/llm_dev/trl$

and rename it to:

(gh_trl) amd00@MZ32-00:~/llm_dev/trl/llama-se/final_checkpoint$ ll
total 65648
drwxrwxr-x 2 amd00 amd00     4096 7月   9 11:32 ./
drwxrwxr-x 3 amd00 amd00     4096 7月   9 11:13 ../
-rw-rw-r-- 1 amd00 amd00      351 7月   9 11:13 adapter_config.json
-rw-rw-r-- 1 amd00 amd00 33600461 7月   9 11:13 adapter_model.bin
-rw-rw-r-- 1 amd00 amd00      351 7月   9 11:32 config.json
-rw-rw-r-- 1 amd00 amd00 33600461 7月   9 11:32 model.bin

then run reward:

(gh_trl) amd00@MZ32-00:~/llm_dev/trl$ torchrun --nnodes 1  --nproc_per_node 2 examples/stack_llama/scripts/reward_modeling.py --model_name=/home/amd00/llm_dev/trl/llama-se/final_checkpoint/
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[2023-07-09 11:32:55,378] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-07-09 11:32:55,392] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/amd00/anaconda3/envs/gh_trl did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_fmnkd2px/none_cih6y_hj/attempt_0/1/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/amd00/anaconda3/envs/gh_trl did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/tmp/torchelastic_fmnkd2px/none_cih6y_hj/attempt_0/0/error.json')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Found cached dataset parquet (/home/amd00/.cache/huggingface/datasets/lvwerra___parquet/lvwerra--stack-exchange-paired-ea956f7e49277b88/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/home/amd00/.cache/huggingface/datasets/lvwerra___parquet/lvwerra--stack-exchange-paired-ea956f7e49277b88/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
Found cached dataset parquet (/home/amd00/.cache/huggingface/datasets/lvwerra___parquet/lvwerra--stack-exchange-paired-6fbcbcc16115b7c8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
output_name : _peft_stack-exchange-paired_rmts__100000_2e-05
Found cached dataset parquet (/home/amd00/.cache/huggingface/datasets/lvwerra___parquet/lvwerra--stack-exchange-paired-6fbcbcc16115b7c8/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
output_name : _peft_stack-exchange-paired_rmts__100000_2e-05
script_args.model_name: /home/amd00/llm_dev/trl/llama-se/final_checkpoint/
Traceback (most recent call last):
  File "/home/amd00/llm_dev/trl/examples/stack_llama/scripts/reward_modeling.py", line 139, in <module>
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_auth_token=True)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 711, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for '/home/amd00/llm_dev/trl/llama-se/final_checkpoint/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/home/amd00/llm_dev/trl/llama-se/final_checkpoint/' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
script_args.model_name: /home/amd00/llm_dev/trl/llama-se/final_checkpoint/
Traceback (most recent call last):
  File "/home/amd00/llm_dev/trl/examples/stack_llama/scripts/reward_modeling.py", line 139, in <module>
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, use_auth_token=True)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 711, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1796, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for '/home/amd00/llm_dev/trl/llama-se/final_checkpoint/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure '/home/amd00/llm_dev/trl/llama-se/final_checkpoint/' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3184) of binary: /home/amd00/anaconda3/envs/gh_trl/bin/python
Traceback (most recent call last):
  File "/home/amd00/anaconda3/envs/gh_trl/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amd00/anaconda3/envs/gh_trl/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/stack_llama/scripts/reward_modeling.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-09_11:33:08
  host      : MZ32-00
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 3185)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-09_11:33:08
  host      : MZ32-00
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 3184)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
younesbelkada commented 1 year ago

Hi @SeekPoint Thanks for the issue? What is the transformers version you are using? Altertanitvely you can manually save the tokenizer on the corresponding folder and run the script. Also make sure to properly format the error trace by wrapping it inside code blocks "" in the beginning and "" in the end.

SeekPoint commented 1 year ago

(gh_trl) amd00@MZ32-00:~/llm_dev/trl$ pip list | grep trans transformers 4.29.2

SeekPoint commented 1 year ago

failed at transformers 4.30.2 yet

SeekPoint commented 1 year ago

why del the example of stack_llama?

lvwerra commented 1 year ago

We didn't delete - we just moved it: https://github.com/lvwerra/trl/tree/main/examples/research_projects/stack_llama/scripts

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.