Open diodiogod opened 3 weeks ago
Using notepad++ searching for [\x84\x93\x94] or [\x82\x91\x92] didn't give me any results in all my 1196 txt files. =(
Just got this error last night when training a LoRA, hadn't happened with the LoRAs I'd trained prior. Search on Google revealed that it's the result of one of the prompt files being encoded in something other than UTF-8. The only file which had been encoded differently had the phrase "The Wizard's", and changing the file back to UTF-8 changed the ' into an invalid character, and since that change training seems to be fine again.
Quick way to fix it is to use the find and replace function in Notepad++ on the folder containing your training data, and replacing ' with nothing. As for changing the encoding back to UTF-8 I'm not entirely sure how to do that automatically but there's probably a way.
I got this error when using llama 3.1 7b as a captioner using with a joy caption script. It was printing a ' (apostrophe) in a non utf-8 format. After I removed them, training went fine...
yes, sometimes the file format gets reset depending if you opened the file with another program and closed it.
You guys are right. Someone (user:Think) on Discord suggested, and I run these two scripts on the caption folder and it worked.
I still think that this falls into a bug category and the trainer could handle this better in the future, as a suggestion.
("note these scripts would run on the current directory, so run them on a backup copy to risk messing up your dataset.")
import os
import string
# Function to replace special characters with basic equivalents
def replace_special_characters(text):
replacements = {
'’': "'",
'‘': "'",
'“': '"',
'”': '"',
'–': '-',
'—': '-',
'…': '...',
'é': 'e',
'è': 'e',
'ê': 'e',
'á': 'a',
'à': 'a',
'â': 'a',
'ó': 'o',
'ò': 'o',
'ô': 'o',
'ú': 'u',
'ù': 'u',
'û': 'u',
'í': 'i',
'ì': 'i',
'î': 'i',
'ç': 'c',
'ñ': 'n',
'ß': 'ss',
'ü': 'u',
'ö': 'o',
'ä': 'a',
'ø': 'o',
'æ': 'ae',
# Add more replacements as needed
}
# Remove all characters not in printable set or replace if in the replacements dictionary
printable = set(string.printable)
result = ''.join(replacements.get(c, c) if c not in printable else c for c in text)
return result
# Function to process all .txt files in the current directory
def process_text_files():
for filename in os.listdir('.'):
if filename.endswith('.txt'):
with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
content = file.read()
# Replace special characters
cleaned_content = replace_special_characters(content)
# Write cleaned content back to the file
with open(filename, 'w', encoding='utf-8') as file:
file.write(cleaned_content)
if __name__ == "__main__":
process_text_files()
import os
def remove_special_characters(text):
# Keep only ASCII characters (characters with ordinal values from 0 to 127)
return ''.join(c if ord(c) < 128 else '' for c in text)
def process_text_files():
for filename in os.listdir('.'):
if filename.endswith('.txt'):
with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
content = file.read()
# Remove non-ASCII characters
cleaned_content = remove_special_characters(content)
# Write cleaned content back to the file
with open(filename, 'w', encoding='utf-8') as file:
file.write(cleaned_content)
if __name__ == "__main__":
process_text_files()
Open Window Setting, Time and Language, Language and Region, Administrative Language Settings, Change System Locale, Check Beta: Use Unicode UTF-8 for Worldwide Language Support. This works for me
This is for bugs only
Did you already ask in the discord?
Yes
You verified that this is a bug and not a feature request or question by asking in the discord?
Yes
Describe the bug
I'm getting this error in the middle of training. Once at 399 step. The second time at 1265. Chatgpt says it's related to a single quotation mark '.
Maybe it's a character in the config folder or the name of a file or caption? Edit: further googling it's related smart quote(’) of Windows-1252. I just don't know how to find and replace it...
My previous LoRa from a person named "Loïc" that I used the name as a trigger word had errors related to the ï character. I had to change it everywhere in the config file. But on the captions I left as it was and it worked. I also think this is a bug. The file name and prompt on the config file should allow this character to be used.
Anyway. This is the problem I'm having now (not related to Loïc) it's a different LoRA.