Closed zky-kf closed 2 months ago
The minimum number of samples should be larger than 20 instances
I see. But After increasing this dataset manually to 22 samples (using the same dummy data), there is still the same error:
2024-09-19 03:58:53.575022: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
09/19/2024 03:58:58 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: False, compute dtype: torch.float16
tokenizer_config.json: 100% 50.6k/50.6k [00:00<00:00, 92.9MB/s]
tokenizer.json: 100% 9.09M/9.09M [00:00<00:00, 11.9MB/s]
special_tokens_map.json: 100% 73.0/73.0 [00:00<00:00, 533kB/s]
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,884 >> loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer.json
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,884 >> loading file added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,885 >> loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/special_tokens_map.json
[INFO|tokenization_utils_base.py:2269] 2024-09-19 03:59:00,885 >> loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/tokenizer_config.json
[INFO|tokenization_utils_base.py:2513] 2024-09-19 03:59:01,289 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
09/19/2024 03:59:01 - INFO - llmtuner.data.template - Add pad token: <|end_of_text|>
09/19/2024 03:59:01 - INFO - llmtuner.data.loader - Loading dataset biochem_preference_train_1.json...
Generating train split: 22 examples [00:00, 1025.38 examples/s]
Converting format of dataset (num_proc=16): 100% 22/22 [00:02<00:00, 10.95 examples/s]
Running tokenizer on dataset (num_proc=16): 100% 22/22 [00:03<00:00, 5.94 examples/s]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 174, in get_dataset
print_function(next(iter(dataset)))
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/llamafactory-cli", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/llmtuner/cli.py", line 65, in main
run_exp()
File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/tuner.py", line 39, in run_exp
run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
File "/usr/local/lib/python3.10/dist-packages/llmtuner/train/dpo/workflow.py", line 29, in run_dpo
dataset = get_dataset(model_args, data_args, training_args, stage="rm", **tokenizer_module)
File "/usr/local/lib/python3.10/dist-packages/llmtuner/data/loader.py", line 176, in get_dataset
raise RuntimeError("Cannot find valid samples, check `data/README.md` for the data format.")
RuntimeError: Cannot find valid samples, check `data/README.md` for the data format.
Actually I tried on my original dataset of 1000 samples, it had the same error. That is why I replace my the dataset to this dummy one to see if it has something to do with invalid character, etc. But turns out the dummy set still does not work TT
please update llamafactory to the latest version, you used a very very obsolete version of llamafactory
Reminder
System Info
llamafactory==0.9.0, using google colab and ran on both the WebUI/ command line
Reproduction
My command
I formulated a dummy preference dataset
biochem_preference_train_1.json
according to readme. The dataset preview in the webui was fine, meaning the data path should be alright.inside dataset_info.json
Expected behavior
start the dpo training
Others
however i got the error below