Closed joytianya closed 1 year ago
This is indeed a mistake on our side, as we have misconfigured the tokenizer to remove repeated spaces. I've updated that configuration and now the tokenizer should preserve all spaces. Please try the new updated tokenizer.
https://huggingface.co/openlm-research/open_llama_7b/blob/main/tokenizer.model Do I just need to update this file?
Yeah
When fine-tuning the code data downstream with https://github.com/young-geng/EasyLM/tree/main, there will be significant issues. Spaces are usually used for indentation. the result is that the indentations disappear. Is there any way to solve it?
the code without indentation such as