sambanova / generative_data_prep

Apache License 2.0
58 stars 8 forks source link

Prompt_prefix not interpreted correctly #84

Open snova-bol opened 8 months ago

snova-bol commented 8 months ago

When toknization with prompt_prefix with \n in it, it is not tokenized correctly with llama tokenizer. Somehow they become \n\n in the tokenization.

I add this --prompt_prefix "\n<|user|>\n" --prompt_postfix "</s>\n<|assistant|>\n" in my script, but the decoded data looks like this </s> \\n<|assistant|>\\n