Closed Gnurro closed 6 months ago
The assumption that encoded == decoded
is not always right. Many things can come into place and specifically for fast tokenizers where there is a know discrepancy around added tokens, see #26455. Would recommend to compare the ids not the string. 🤗
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
The string returned by apply_chat_template() with tokenize=False does not match the apply_chat_template()-encoded, then decoded string with the LLama2 chat tokenizer.