ocean-data-factory-sweden / kso

Notebooks to upload/download marine footage, connect to a citizen science project, train machine learning models and publish marine biological observations.
GNU General Public License v3.0
4 stars 12 forks source link

Notebook 5: locale.getpreferredencoding() gets changed during the training. Causing the notebook to not be able to train again or run the evaluation part. #148

Closed Diewertje11 closed 1 year ago

Diewertje11 commented 1 year ago

When you run Notebook 5, and request the preferred encoding at the beginning, or just before the cell where you do train.run(...), you get 'UTF-8'. (using code below)

import locale locale.getpreferredencoding()

However, when you run the same thing after the cell in which you train, it returns 'ANSI_X3.4-1968'. (which is ASCII). So somewhere during this training that is performed by the YOLO5 code, this default gets changed. This causes an error with reading the names in the train.txt or valid.txt file when you train again or do the validation. (since these files contain Swedish letters, in the case of the template project)

Exception: train: Error loading data from /content/koster_yolov4/tutorials/ml-template-data/train.txt: 'ascii' codec can't decode byte 0xc3 in position 31: ordinal not in range(128)

This comes from line 470 in /content/koster_yolov4/yolov5/utils/dataloaders.py where the text file is opened with open(). This open() function uses the default encoding, and ASCII cannot read the ä.

We have not located exactly how this change in locale is made. We could not find anything in the code from YOLO5, when we search with git grep for ANSI, locale, encoding, ASCII, coding. only in the file utils/mws/mime.sh they do something with ASCII, but we do not think this file gets used.

Solutions would be to or prevent this change if we can locate where it is made. Or by every time setting it back to the correct default. However, we have not found a command yet that can set it back. We have tried the following:

So it seems like there are 2 different encoding settings. One system wide one, that stays at UTF8 and is not changed, and one locale that gets changed. However, trying to change this back gives an error:

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

The ways we have tried to set it back:

The code below seems to set it back, but it does not solve the issue when training/validating, so it just sets it to a string or something.

import locale def getpreferredencoding(do_setlocale = True): return "UTF-8" locale.getpreferredencoding = getpreferredencoding

locale.getpreferredencoding()

To have the template project working for the workshop on 02-03-2023, we simply change the names of the files so that they do not contain any ä or other Swedish letters.

Diewertje11 commented 1 year ago

The bot had closed this issue, but I tested it, and the same problem occurs. So it is not solved yet.

jannesgg commented 2 days ago

Close for now. Re-open if necessary. YOLOV5 code.