Open LordSyd opened 1 month ago
if you using chat template, data should be in chat template format as jsonl file. see this dataset's messages column: https://huggingface.co/datasets/HuggingFaceH4/no_robots
Ok, so could me using csv be the problem here? I just spotted that the error seems to point at the comma ( invalid syntax (, line 1) ). Also thank you for the link!
not really. you can use both csv and jsonl formats. it would be difficult to correctly use csv containing jsons so its not recommended.
the dataset format you have, doesnt require any chat template. you can train it without any chat template. however, most models have special tags and tokens and if you dont apply them, you might not get desired results.
Ok, that last info might also be relevant. I got infinite generation using this dataset's trained model, so the missing tokens might also be the problem. That's why I wanted to use a chat template on the dataset.
I will reformat the JSON I have with the training data to fit the chat template and convert it to JSONL to try it again. So this seems to be my error.
Still, the documentation is kind of misleading regarding the file format. On this page: https://huggingface.co/docs/autotrain/v0.8.4/en/llm_finetuning
Up top it tells you that CSV is accepted, but down at the bottom of the page is mentioned that it accepts CSV and JSONL, which is quite confusing. That's why I converted everything to CSV, to fit it into the mentioned format.
sorry about that. it should say both csv and jsonl are accepted for most tasks. ill fix it
This issue is stale because it has been open for 30 days with no activity.
Prerequisites
Backend
Hugging Face Space/Endpoints
Interface Used
UI
CLI Command
No response
UI Screenshots & Parameters
I also tried "tokenizer", no difference
Error Logs
ERROR | 2024-07-19 09:48:57 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/main.py", line 28, in train train_sft(config) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/train_clm_sft.py", line 17, in train train_data, valid_data = utils.process_data_with_chat_template(config, tokenizer, train_data, valid_data) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/utils.py", line 448, in process_data_with_chat_template train_data = train_data.map( File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, fn_kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/utils.py", line 238, in apply_chat_template messages = ast.literal_eval(messages) File "/app/env/lib/python3.10/ast.py", line 64, in literal_eval node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval') File "/app/env/lib/python3.10/ast.py", line 50, in parse return compile(source, filename, mode, flags, File "", line 1
ah ****, du brauchst kommunikativere Freunde xD
^^^^
SyntaxError: invalid syntax
ERROR | 2024-07-19 09:48:57 | autotrain.trainers.common:wrapper:121 - invalid syntax (, line 1)
Additional Information
I censored the swear word in the error log myself.
This is how the csv is structured: (just the first line as an example)
I am not really sure what format the data should be in if I use a chat template in Autotrain.
Does it still need {text: "text"} as format? Or should it be in the format the chattemplate would expect?
Thanks for your assistance