huggingface / autotrain-advanced

🤗 AutoTrain Advanced
https://huggingface.co/autotrain
Apache License 2.0
3.76k stars 458 forks source link

[BUG] Syntax Error when using chat template #707

Open LordSyd opened 1 month ago

LordSyd commented 1 month ago

Prerequisites

Backend

Hugging Face Space/Endpoints

Interface Used

UI

CLI Command

No response

UI Screenshots & Parameters

image I also tried "tokenizer", no difference

Error Logs

ERROR | 2024-07-19 09:48:57 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last): File "/app/env/lib/python3.10/site-packages/autotrain/trainers/common.py", line 117, in wrapper return func(*args, kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/main.py", line 28, in train train_sft(config) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/train_clm_sft.py", line 17, in train train_data, valid_data = utils.process_data_with_chat_template(config, tokenizer, train_data, valid_data) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/utils.py", line 448, in process_data_with_chat_template train_data = train_data.map( File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3156, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3517, in _map_single example = apply_function_on_filtered_inputs(example, i, offset=offset) File "/app/env/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, fn_kwargs) File "/app/env/lib/python3.10/site-packages/autotrain/trainers/clm/utils.py", line 238, in apply_chat_template messages = ast.literal_eval(messages) File "/app/env/lib/python3.10/ast.py", line 64, in literal_eval node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval') File "/app/env/lib/python3.10/ast.py", line 50, in parse return compile(source, filename, mode, flags, File "", line 1 ah ****, du brauchst kommunikativere Freunde xD ^^^^ SyntaxError: invalid syntax

ERROR | 2024-07-19 09:48:57 | autotrain.trainers.common:wrapper:121 - invalid syntax (, line 1)

Additional Information

I censored the swear word in the error log myself.

This is how the csv is structured: (just the first line as an example)

id,text
1,"Human: ah ****, du brauchst kommunikativere Freunde xD Assistant: Ja xD"

I am not really sure what format the data should be in if I use a chat template in Autotrain.

Does it still need {text: "text"} as format? Or should it be in the format the chattemplate would expect?

Thanks for your assistance

abhishekkrthakur commented 1 month ago

if you using chat template, data should be in chat template format as jsonl file. see this dataset's messages column: https://huggingface.co/datasets/HuggingFaceH4/no_robots

LordSyd commented 1 month ago

Ok, so could me using csv be the problem here? I just spotted that the error seems to point at the comma ( invalid syntax (, line 1) ). Also thank you for the link!

abhishekkrthakur commented 1 month ago

not really. you can use both csv and jsonl formats. it would be difficult to correctly use csv containing jsons so its not recommended.

the dataset format you have, doesnt require any chat template. you can train it without any chat template. however, most models have special tags and tokens and if you dont apply them, you might not get desired results.

LordSyd commented 1 month ago

Ok, that last info might also be relevant. I got infinite generation using this dataset's trained model, so the missing tokens might also be the problem. That's why I wanted to use a chat template on the dataset.

I will reformat the JSON I have with the training data to fit the chat template and convert it to JSONL to try it again. So this seems to be my error.

Still, the documentation is kind of misleading regarding the file format. On this page: https://huggingface.co/docs/autotrain/v0.8.4/en/llm_finetuning

Up top it tells you that CSV is accepted, but down at the bottom of the page is mentioned that it accepts CSV and JSONL, which is quite confusing. That's why I converted everything to CSV, to fit it into the mentioned format.

abhishekkrthakur commented 1 month ago

sorry about that. it should say both csv and jsonl are accepted for most tasks. ill fix it

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 30 days with no activity.