Issue Converting Weights to Huggingface Format

davismartens commented 1 year ago

I'm trying to convert the weights as per the example but running into an issue.

After mkdir huggingface_models \ && python tools/convert_to_hf_gptneox.py \ --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5 --save-path /huggingface_models/GPT-NeoXT-Chat-Base-20B --n-stages 8 --n-layer-per-stage 6

I'm getting this error: Traceback (most recent call last): File "/mnt/c/Users/name/OpenChatKit/tools/convert_to_hf_gptneox.py", line 102, in <module> assert args.save_path is not None AssertionError --save-path: command not found --n-stages: command not found --n-layer-per-stage: command not found

I'm using Windows 11 WSL Ubuntu 22.04.2 LTS

csris commented 1 year ago

That's a typo in the README. I'll put a fix up in a moment. It should be:

mkdir huggingface_models \ 
  && python tools/convert_to_hf_gptneox.py \ 
       --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5  \
       --save-path huggingface_models/GPT-NeoXT-Chat-Base-20B  \
       --n-stages 8  \
       --n-layer-per-stage 6

Note all the backslashes.

Let me know if that works for you!

(edit: fixed typo in the command)

davismartens commented 1 year ago

@csris thanks, I'm no longer getting the error but now I'm getting:

Traceback (most recent call last): File "/mnt/c/Users/[user]/OpenChatKit/tools/convert_to_hf_gptneox.py", line 105, in <module> os.mkdir(args.save_path) FileNotFoundError: [Errno 2] No such file or directory: '/huggingface_models/GPT-NeoXT-Chat-Base-20B'

The command expects GPT-NeoXT-Chat-Base-20B to be in /huggingface_models/. Is this supposed to point to pretrained or another directory?

csris commented 1 year ago

The README has another typo. Run this from the root of the repo:

mkdir huggingface_models \ 
  && python tools/convert_to_hf_gptneox.py \ 
       --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_5  \
       --save-path huggingface_models/GPT-NeoXT-Chat-Base-20B  \
       --n-stages 8  \
       --n-layer-per-stage 6

csris commented 1 year ago

Also, make sure to update the path to the checkpoint (the --ckpt-path flag) to point at your desired checkpoint.

davismartens commented 1 year ago

@csris is there documentation on the different checkpoints? How do I decide which --ckpt-path to pick?

csris commented 1 year ago

@LorrinWWW can give better advice than I can. But I'll do my best:

The training/finetune_GPT-NeoXT-Chat-Base-20B.sh script saves checkpoints to the model_ckpts/GPT-NeoXT-Chat-Base-20B directory during training.
The script, by default, writes a checkpoint every 100 steps.
As the script writes checkpoints, you should see sub-directories named checkpoint_100, checkpoint_200, etc.
If you're training on 8 A100 80GB GPUs, it takes about an hour per checkpoint.
When in doubt, just pick the most recent checkpoint, the one with the highest number in the directory name.

If you just want to make sure the toolchain is working, you can configure the script to produce a checkpoint every 5 steps, so you don't have to wait an hour. Just change the CHECKPOINT_STEPS variable on this line to 5.

LorrinWWW commented 1 year ago

@davismartens The training script saves a ckpt per CHECKPOINT_STEPS, so usually you can just pick the latest one :)

davismartens commented 1 year ago

@LorrinWWW great thanks. Can I run the pretrained model without training too?

LorrinWWW commented 1 year ago

@LorrinWWW great thanks. Can I run the pretrained model without training too?

Sure! You can run our pretrained base model.

csris commented 1 year ago

@davismartens, would you like to join our Discord server? Here's an invite link: https://discord.gg/9Rk6sSeWEG.

davismartens commented 1 year ago

@LorrinWWW thank you. When I run ython inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B I recieve the following error:

Traceback (most recent call last):
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 7, in <module>
    import retrieval.wikipedia as wp
ModuleNotFoundError: No module named 'retrieval'

Any idea why it doesn't work?

@csris joined :)

LorrinWWW commented 1 year ago

@davismartens It appears that the bot.py is unable to locate the retrieval module, which should be present in the root directory of the OpenChatKit repository.

Could you try running the bot.py script again while ensuring that you cd to the correct directory (in your case, /mnt/c/Users/davis/dev-projects/OpenChatKit/)?

davismartens commented 1 year ago

@LorrinWWW retrieval is present and I'm running from root.

(OpenChatKit) davismartens@LAPTOP-F6477QET:/mnt/c/Users/davis/dev-projects/OpenChatKit$ python inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B
Traceback (most recent call last):
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 7, in <module>
    import retrieval.wikipedia as wp
ModuleNotFoundError: No module named 'retrieval'

But for some reason bot.py doesn't find the module.

LorrinWWW commented 1 year ago

@davismartens Can you try this? export PYTHONPATH=/mnt/c/Users/davis/dev-projects/OpenChatKit:$PYTHONPATH

davismartens commented 1 year ago

@LorrinWWW that resolved one issue but now I'm getting this error:

(OpenChatKit) davismartens@LAPTOP-F6477QET:/mnt/c/Users/davis/dev-projects/OpenChatKit$ export PYTHONPATH=/mnt/c/Users/davis/dev-projects/OpenChatKit:$PYTHONPATH
(OpenChatKit) davismartens@LAPTOP-F6477QET:/mnt/c/Users/davis/dev-projects/OpenChatKit$ python inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B
Loading togethercomputer/GPT-NeoXT-Chat-Base-20B to cuda:0...
Traceback (most recent call last):
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/configuration_utils.py", line 616, in _get_config_dict
    resolved_config_file = cached_path(
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/utils/hub.py", line 284, in cached_path
    output_path = get_from_cache(
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/utils/hub.py", line 494, in get_from_cache
    raise EnvironmentError("You specified use_auth_token=True, but a huggingface token was not found.")
OSError: You specified use_auth_token=True, but a huggingface token was not found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 184, in <module>
    main()
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 180, in main
    ).cmdloop()
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/cmd.py", line 105, in cmdloop
    self.preloop()
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 63, in preloop
    self._model = ChatModel(self._model_name_or_path, self._gpu_id)
  File "/mnt/c/Users/davis/dev-projects/OpenChatKit/inference/bot.py", line 21, in __init__
    self._model = AutoModelForCausalLM.from_pretrained(
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 423, in from_pretrained
    config, kwargs = AutoConfig.from_pretrained(
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 725, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/configuration_utils.py", line 561, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/davismartens/miniconda3/envs/OpenChatKit/lib/python3.10/site-packages/transformers/configuration_utils.py", line 656, in _get_config_dict
    raise EnvironmentError(
OSError: Can't load config for 'togethercomputer/GPT-NeoXT-Chat-Base-20B'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'togethercomputer/GPT-NeoXT-Chat-Base-20B' is the correct path to a directory containing a config.json file

Seems like I need to pass an HF token somewhere?

LorrinWWW commented 1 year ago

@davismartens That's true, we specified use_auth_token=True..

You can either login HF:

pip install --upgrade huggingface_hub
huggingface-cli login

Or, since togethercomputer/GPT-NeoXT-Chat-Base-20B is publicly available now, you can simply remove use_auth_token=True from this line and re-run the inference code.

TX-Yeager commented 1 year ago

@LorrinWWW what is the different between default in prepare.py and togethercomputer/GPT-NeoXT-Chat-Base-20B?

LorrinWWW commented 1 year ago

@TX-Yeager It shards the ckpt by layer so it is more convenient to do pipeline parallel training. :)

togethercomputer / OpenChatKit

Issue Converting Weights to Huggingface Format #21