ostris / ai-toolkit

Various AI scripts. Mostly Stable Diffusion stuff.
MIT License
2.55k stars 247 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 975: invalid start byte #113

Open diodiogod opened 3 weeks ago

diodiogod commented 3 weeks ago

This is for bugs only

Did you already ask in the discord?

Yes

You verified that this is a bug and not a feature request or question by asking in the discord?

Yes

Describe the bug

I'm getting this error in the middle of training. Once at 399 step. The second time at 1265. Chatgpt says it's related to a single quotation mark '.

Maybe it's a character in the config folder or the name of a file or caption? Edit: further googling it's related smart quote(’) of Windows-1252. I just don't know how to find and replace it...

My previous LoRa from a person named "Loïc" that I used the name as a trigger word had errors related to the ï character. I had to change it everywhere in the config file. But on the captions I left as it was and it worked. I also think this is a bug. The file name and prompt on the config file should allow this character to be used.

Anyway. This is the problem I'm having now (not related to Loïc) it's a different LoRA.

Traceback (most recent call last):
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 90, in <module>
    main()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 86, in main
    raise e
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\run.py", line 78, in main
    job.run()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\jobs\ExtensionJob.py", line 22, in run
    process.run()
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\jobs\process\BaseSDTrainProcess.py", line 1667, in run
    batch = next(dataloader_iterator)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 630, in __next__    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataloader.py", line 673, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\_utils\fetch.py", line 54, in fetch
    data = self.dataset[possibly_batched_index]
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\venv\Lib\site-packages\torch\utils\data\dataset.py", line 350, in __getitem__    return self.datasets[dataset_idx][sample_idx]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\data_loader.py", line 539, in __getitem__
    return [self._get_single_item(idx) for idx in idx_list]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\data_loader.py", line 527, in _get_single_item
    file_item.load_caption(self.caption_dict)
  File "J:\Aitools\Ostris_tools\2\ai-toolkit\toolkit\dataloader_mixins.py", line 305, in load_caption
    prompt = f.read()
             ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 975: invalid start byte
diodiogod commented 3 weeks ago

Using notepad++ searching for [\x84\x93\x94] or [\x82\x91\x92] didn't give me any results in all my 1196 txt files. =(

setothegreat commented 3 weeks ago

Just got this error last night when training a LoRA, hadn't happened with the LoRAs I'd trained prior. Search on Google revealed that it's the result of one of the prompt files being encoded in something other than UTF-8. The only file which had been encoded differently had the phrase "The Wizard's", and changing the file back to UTF-8 changed the ' into an invalid character, and since that change training seems to be fine again.

Quick way to fix it is to use the find and replace function in Notepad++ on the folder containing your training data, and replacing ' with nothing. As for changing the encoding back to UTF-8 I'm not entirely sure how to do that automatically but there's probably a way.

futureflix87 commented 3 weeks ago

I got this error when using llama 3.1 7b as a captioner using with a joy caption script. It was printing a ' (apostrophe) in a non utf-8 format. After I removed them, training went fine...

WarAnakin commented 3 weeks ago

yes, sometimes the file format gets reset depending if you opened the file with another program and closed it.

diodiogod commented 3 weeks ago

You guys are right. Someone (user:Think) on Discord suggested, and I run these two scripts on the caption folder and it worked.

I still think that this falls into a bug category and the trainer could handle this better in the future, as a suggestion.

("note these scripts would run on the current directory, so run them on a backup copy to risk messing up your dataset.")

import os
import string

# Function to replace special characters with basic equivalents
def replace_special_characters(text):
    replacements = {
        '’': "'",
        '‘': "'",
        '“': '"',
        '”': '"',
        '–': '-',
        '—': '-',
        '…': '...',
        'é': 'e',
        'è': 'e',
        'ê': 'e',
        'á': 'a',
        'à': 'a',
        'â': 'a',
        'ó': 'o',
        'ò': 'o',
        'ô': 'o',
        'ú': 'u',
        'ù': 'u',
        'û': 'u',
        'í': 'i',
        'ì': 'i',
        'î': 'i',
        'ç': 'c',
        'ñ': 'n',
        'ß': 'ss',
        'ü': 'u',
        'ö': 'o',
        'ä': 'a',
        'ø': 'o',
        'æ': 'ae',
        # Add more replacements as needed
    }

    # Remove all characters not in printable set or replace if in the replacements dictionary
    printable = set(string.printable)
    result = ''.join(replacements.get(c, c) if c not in printable else c for c in text)

    return result

# Function to process all .txt files in the current directory
def process_text_files():
    for filename in os.listdir('.'):
        if filename.endswith('.txt'):
            with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
                content = file.read()

            # Replace special characters
            cleaned_content = replace_special_characters(content)

            # Write cleaned content back to the file
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(cleaned_content)

if __name__ == "__main__":
    process_text_files()
import os

def remove_special_characters(text):
    # Keep only ASCII characters (characters with ordinal values from 0 to 127)
    return ''.join(c if ord(c) < 128 else '' for c in text)

def process_text_files():
    for filename in os.listdir('.'):
        if filename.endswith('.txt'):
            with open(filename, 'r', encoding='utf-8', errors='ignore') as file:
                content = file.read()

            # Remove non-ASCII characters
            cleaned_content = remove_special_characters(content)

            # Write cleaned content back to the file
            with open(filename, 'w', encoding='utf-8') as file:
                file.write(cleaned_content)

if __name__ == "__main__":
    process_text_files()
airobinnet commented 2 weeks ago

128 should fix this

xFoolery commented 2 weeks ago

Open Window Setting, Time and Language, Language and Region, Administrative Language Settings, Change System Locale, Check Beta: Use Unicode UTF-8 for Worldwide Language Support. This works for me