openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.29k stars 372 forks source link

the code indentations disappear. Is there any way to solve it? #43

Closed joytianya closed 1 year ago

joytianya commented 1 year ago

When fine-tuning the code data downstream with https://github.com/young-geng/EasyLM/tree/main, there will be significant issues. Spaces are usually used for indentation. the result is that the indentations disappear. Is there any way to solve it?

the code without indentation such as

def bubble_sort(arr):
 n = len(arr)
 for i in range(n-1):
 for j in range(n-i-1):
 if arr[j] > arr[j+1]:
 arr[j], arr[j+1] = arr[j+1], arr[j]
 return arr
danielhanchen commented 1 year ago

40 duplicate :) Can you see already posed in there :)

young-geng commented 1 year ago

This is indeed a mistake on our side, as we have misconfigured the tokenizer to remove repeated spaces. I've updated that configuration and now the tokenizer should preserve all spaces. Please try the new updated tokenizer.

joytianya commented 1 year ago

https://huggingface.co/openlm-research/open_llama_7b/blob/main/tokenizer.model Do I just need to update this file?

young-geng commented 1 year ago

Yeah