Finetuning on a custom dataset

pankajtalk commented 7 months ago

System Info

Various versions 2024-01-10 08:35:17 - Successfully installed bitsandbytes-0.39.1 black-23.12.1 brotli-1.1.0 inflate64-1.0.0 llama-recipes-0.0.1 multivolumefile-0.2.3 pathspec-0.12.1 peft-0.6.0.dev0 py7zr-0.20.6 pybcj-1.0.2 pycryptodomex-3.19.1 pyppmd-1.0.0 pyzstd-0.15.9 texttable-1.7.0 tokenize-rt-5.2.0 tomli-2.0.1 torch-2.1.0+cu118 triton-2.1.0

Finetuning command being executed. torchrun --nnode=4 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=10.0.1.14:29400 --rdzv_conf=read_timeout=600 examples/finetuning.py --dataset "custom_dataset" --custom_dataset.file "/mnt/scripts/custom_dataset.py" --enable_fsdp --use_peft --peft_method lora --pure_bf16 --mixed_precision --batch_size_training 1 --model_name $MODEL_NAME --output_dir /home/datascience/outputs --num_epochs 1 --save_model

Information

[X] The official example scripts
[ ] My own modified scripts

🐛 Describe the bug

I am using below command to finetune "Llama-2-7b-hf" model on a custom dataset. I have specified the --dataset and --custom_dataset.file params to the finetuning.py file.

torchrun examples/finetuning.py --enable_fsdp --dataset custom_dataset --custom_dataset.file /mnt/scripts/custom_dataset.py --use_peft --peft_method lora --pure_bf16 --mixed_precision --batch_size_training 1 --model_name $MODEL_NAME --output_dir /home/datascience/outputs --num_epochs 1 --save_model

However, I am running into below error. Am I missing something?

2024-01-10 08:38:31 - Traceback (most recent call last):
2024-01-10 08:38:31 -   File "/home/datascience/decompressed_artifact/code/examples/finetuning.py", line 8, in <module>
2024-01-10 08:38:31 -     fire.Fire(main)
2024-01-10 08:38:31 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2024-01-10 08:38:31 -     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2024-01-10 08:38:32 -     component, remaining_args = _CallAndUpdateTrace(
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2024-01-10 08:38:32 -     component = fn(*varargs, **kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/finetuning.py", line 160, in main
2024-01-10 08:38:32 -     dataset_config = generate_dataset_config(train_config, kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/utils/config_utils.py", line 56, in generate_dataset_config
2024-01-10 08:38:32 -     assert train_config.dataset in names, f"Unknown dataset: {train_config.dataset}"
2024-01-10 08:38:32 - AssertionError: Unknown dataset: custom_dataset

Error logs

2024-01-10 08:38:31 - Traceback (most recent call last):
2024-01-10 08:38:31 -   File "/home/datascience/decompressed_artifact/code/examples/finetuning.py", line 8, in <module>
2024-01-10 08:38:31 -     fire.Fire(main)
2024-01-10 08:38:31 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
2024-01-10 08:38:31 -     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
2024-01-10 08:38:32 -     component, remaining_args = _CallAndUpdateTrace(
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2024-01-10 08:38:32 -     component = fn(*varargs, **kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/finetuning.py", line 160, in main
2024-01-10 08:38:32 -     dataset_config = generate_dataset_config(train_config, kwargs)
2024-01-10 08:38:32 -   File "/home/datascience/conda/pytorch20_p39_gpu_v2/lib/python3.9/site-packages/llama_recipes/utils/config_utils.py", line 56, in generate_dataset_config
2024-01-10 08:38:32 -     assert train_config.dataset in names, f"Unknown dataset: {train_config.dataset}"
2024-01-10 08:38:32 - AssertionError: Unknown dataset: custom_dataset

Expected behavior

Finetuning should work with custom dataset.

HamidShojanazeri commented 7 months ago

@pankajtalk this works on my end, just want to make sure you already have installed llama-recipe right?

 torchrun --nnode=1 --nproc_per_node=8  examples/finetuning.py --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" --enable_fsdp --use_peft --peft_method lora --pure_bf16 --mixed_precision --batch_size_training 1 --model_name $MODEL_PATH --out
put_dir /home/datascience/outputs --num_epochs 1 --save_model"

pankajtalk commented 7 months ago

@HamidShojanazeri The script invocation works fine for me if I do not specify the --dataset and --custom_dataset.file params. samsum dataset is being used in that case. Once I specify --dataset and --custom_dataset.file, I get the error I specified i.e. "Unknown dataset: custom_dataset" from config_utils.py.

Is there any param I can add to triage it further?

pankajtalk commented 7 months ago

I think llama-recipes v0.0.1 (which seems to be the latest version) does not contain custom dataset. I checked dataset_utils.py and see

DATASET_PREPROC = {
    "alpaca_dataset": partial(get_alpaca_dataset, max_words=224),
    "grammar_dataset": get_grammar_dataset,
    "samsum_dataset": get_samsum_dataset, 
}

whereas the one at https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/utils/dataset_utils.py has

DATASET_PREPROC = {
    "alpaca_dataset": partial(get_alpaca_dataset),
    "grammar_dataset": get_grammar_dataset,
    "samsum_dataset": get_samsum_dataset,
    "custom_dataset": get_custom_dataset,
}

pankajtalk commented 7 months ago

As per https://pypi.org/project/llama-recipes/#history, the only release of llama-recipes was on Sep 7, 2023. Any plans to release a newer version with latest code?

HamidShojanazeri commented 7 months ago

@pankajtalk we are working on finalizing the release in the mean time can you pls install from src pip install -e .

HamidShojanazeri commented 7 months ago

@HamidShojanazeri The script invocation works fine for me if I do not specify the --dataset and --custom_dataset.file params. samsum dataset is being used in that case. Once I specify --dataset and --custom_dataset.file, I get the error I specified i.e. "Unknown dataset: custom_dataset" from config_utils.py.

Is there any param I can add to triage it further?

I believe it should run open assistant .

meta-llama / llama-recipes