"Cannot find valid samples" when running DPO on llama3-8b

zky-kf commented 2 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory==0.9.0, using google colab and ran on both the WebUI/ command line

Reproduction

My command

!CUDA_VISIBLE_DEVICES=0 llamafactory-cli train \
    --stage dpo \
    --do_train True \
    --model_name_or_path meta-llama/Meta-Llama-3-8B \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template default \
    --flash_attn auto \
    --dataset_dir /dir-to-preference-data \
    --dataset biochem_train \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir saves/LLaMA3-8B/lora/lora_dpo_3 \
    --fp16 True \
    --plot_loss True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target q_proj,v_proj \
    --dpo_beta 0.1 \
    --dpo_ftx 0

I formulated a dummy preference dataset biochem_preference_train_1.json according to readme. The dataset preview in the webui was fine, meaning the data path should be alright.

[
  {
    "instruction": "You are an assistant for multiple choice tasks.Your output should include ONE correct answer (A or B or C or D) and your explanation.Use the following retrieved context to answer the question.Use three sentences maximum and keep the answer concise. ",
    "input": "Question: 增强子的作用特点是（　　）。(A) 只作用于真核细胞中(B) 有严格的专一性(C) 无需与蛋白质因子结合就能增强转录作用(D) 作用无方向性",
    "chosen": "The correct answer is (D) No direction of action.Explanation: The passage states that enhancers can regulate genes at long distances (usually 1 to 4 kb or even 30 kb away), and they do not have a specific direction, meaning their function is not directional. They can still function when the sequence is reversed, unlike promoters which lose their function when the direction is reversed.",
    "rejected": "The correct answer is C) 无需与蛋白质因子结合就能增强转录作用.Explanation: According to the context, enhancers (增强子) do not require protein factors to bind in order to enhance transcription. This is stated in point (5): 需要有启动子才能发挥作用，没有启动子存在，增强子不能表现活性 which translates to Enhancers need a promoter to play their role, and without a promoter, they cannot manifest activity."
  },
  {
    "instruction": "You are an assistant for multiple choice tasks.Your output should include ONE correct answer (A or B or C or D) and your explanation.Use the following retrieved context to answer the question.Use three sentences maximum and keep the answer concise. ",
    "input": "Question: 增强子的作用特点是（　　）。(A) 只作用于真核细胞中(B) 有严格的专一性(C) 无需与蛋白质因子结合就能增强转录作用(D) 作用无方向性",
    "chosen": "The correct answer is (D) No direction of action.Explanation: The passage states that enhancers can regulate genes at long distances (usually 1 to 4 kb or even 30 kb away), and they do not have a specific direction, meaning their function is not directional. They can still function when the sequence is reversed, unlike promoters which lose their function when the direction is reversed.",
    "rejected": "The correct answer is C) 无需与蛋白质因子结合就能增强转录作用.Explanation: According to the context, enhancers (增强子) do not require protein factors to bind in order to enhance transcription. This is stated in point (5): 需要有启动子才能发挥作用，没有启动子存在，增强子不能表现活性 which translates to Enhancers need a promoter to play their role, and without a promoter, they cannot manifest activity."
  },
  {
    "instruction": "You are an assistant for multiple choice tasks.Your output should include ONE correct answer (A or B or C or D) and your explanation.Use the following retrieved context to answer the question.Use three sentences maximum and keep the answer concise. ",
    "input": "Question: 增强子的作用特点是（　　）。(A) 只作用于真核细胞中(B) 有严格的专一性(C) 无需与蛋白质因子结合就能增强转录作用(D) 作用无方向性",
    "chosen": "The correct answer is (D) No direction of action.Explanation: The passage states that enhancers can regulate genes at long distances (usually 1 to 4 kb or even 30 kb away), and they do not have a specific direction, meaning their function is not directional. They can still function when the sequence is reversed, unlike promoters which lose their function when the direction is reversed.",
    "rejected": "The correct answer is C) 无需与蛋白质因子结合就能增强转录作用.Explanation: According to the context, enhancers (增强子) do not require protein factors to bind in order to enhance transcription. This is stated in point (5): 需要有启动子才能发挥作用，没有启动子存在，增强子不能表现活性 which translates to Enhancers need a promoter to play their role, and without a promoter, they cannot manifest activity."
  },
  {
    "instruction": "You are an assistant for multiple choice tasks.Your output should include ONE correct answer (A or B or C or D) and your explanation.Use the following retrieved context to answer the question.Use three sentences maximum and keep the answer concise. ",
    "input": "Question: 增强子的作用特点是（　　）。(A) 只作用于真核细胞中(B) 有严格的专一性(C) 无需与蛋白质因子结合就能增强转录作用(D) 作用无方向性",
    "chosen": "The correct answer is (D) No direction of action.Explanation: The passage states that enhancers can regulate genes at long distances (usually 1 to 4 kb or even 30 kb away), and they do not have a specific direction, meaning their function is not directional. They can still function when the sequence is reversed, unlike promoters which lose their function when the direction is reversed.",
    "rejected": "The correct answer is C) 无需与蛋白质因子结合就能增强转录作用.Explanation: According to the context, enhancers (增强子) do not require protein factors to bind in order to enhance transcription. This is stated in point (5): 需要有启动子才能发挥作用，没有启动子存在，增强子不能表现活性 which translates to Enhancers need a promoter to play their role, and without a promoter, they cannot manifest activity."
  }
]

inside dataset_info.json

{
    "biochem_train": {
        "file_name": "biochem_preference_train_1.json",
        "ranking": true,
        "columns": {
            "prompt": "instruction",
            "query": "input",
            "chosen": "chosen",
            "rejected": "rejected"
        }
    }
}

Expected behavior

start the dpo training

Others

however i got the error below

2024-09-17 07:06:38.389902: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-17 07:06:38.408209: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-17 07:06:38.429936: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-17 07:06:38.436541: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-17 07:06:38.452401: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-17 07:06:39.674862: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
09/17/2024 07:06:44 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
[INFO|tokenization_utils_base.py:2269] 2024-09-17 07:06:44,968 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer.json
[INFO|tokenization_utils_base.py:2269] 2024-09-17 07:06:44,969 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2269] 2024-09-17 07:06:44,969 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/special_tokens_map.json
[INFO|tokenization_utils_base.py:2269] 2024-09-17 07:06:44,969 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer_config.json
[INFO|tokenization_utils_base.py:2513] 2024-09-17 07:06:45,376 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/17/2024 07:06:45 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
09/17/2024 07:06:45 - INFO - llmtuner.data.loader - Loading dataset biochem_preference_train_1.json...
Generating train split: 5 examples [00:00, 262.65 examples/s]
num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.
Converting format of dataset (num_proc=5): 100% 5/5 [00:00<00:00, 30.75 examples/s]
num_proc must be <= 5. Reducing num_proc to 5 for dataset of size 5.
Running tokenizer on dataset (num_proc=5): 100% 5/5 [00:01<00:00,  3.28 examples/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 174, in get_dataset
    print_function(next(iter(dataset)))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/cli.py", line 65, in main
    run_exp()
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/tuner.py", line 39, in run_exp
    run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/dpo/workflow.py", line 29, in run_dpo
    dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 176, in get_dataset
    raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
RuntimeError: Cannot find valid samples, check `data/README.md` for the data format.

hiyouga commented 2 months ago

The minimum number of samples should be larger than 20 instances

zky-kf commented 2 months ago

I see. But After increasing this dataset manually to 22 samples (using the same dummy data), there is still the same error:

2024-09-19 03:58:53.575022: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
09/19/2024 03:58:58 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
tokenizer_config.json: 100% 50.6k/50.6k [00:00<00:00, 92.9MB/s]
tokenizer.json: 100% 9.09M/9.09M [00:00<00:00, 11.9MB/s]
special_tokens_map.json: 100% 73.0/73.0 [00:00<00:00, 533kB/s]
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,884 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer.json
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,884 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,885 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/special_tokens_map.json
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,885 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer_config.json
[INFO|tokenization_utils_base.py:2513] 2024-09-19 03:59:01,289 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/19/2024 03:59:01 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
09/19/2024 03:59:01 - INFO - llmtuner.data.loader - Loading dataset biochem_preference_train_1.json...
Generating train split: 22 examples [00:00, 1025.38 examples/s]
Converting format of dataset (num_proc=16): 100% 22/22 [00:02<00:00, 10.95 examples/s]
Running tokenizer on dataset (num_proc=16): 100% 22/22 [00:03<00:00,  5.94 examples/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 174, in get_dataset
    print_function(next(iter(dataset)))
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/cli.py", line 65, in main
    run_exp()
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/tuner.py", line 39, in run_exp
    run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/dpo/workflow.py", line 29, in run_dpo
    dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
  File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 176, in get_dataset
    raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
RuntimeError: Cannot find valid samples, check `data/README.md` for the data format.

Actually I tried on my original dataset of 1000 samples, it had the same error. That is why I replace my the dataset to this dummy one to see if it has something to do with invalid character, etc. But turns out the dummy set still does not work TT

hiyouga commented 2 months ago

please update llamafactory to the latest version, you used a very very obsolete version of llamafactory

hiyouga / LLaMA-Factory