ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.21k stars 1.19k forks source link

[BUGFIX] Guard against UnicodeEncodeError when saving validation results in Google Colab environment #3875

Closed alexsherstinsky closed 10 months ago

alexsherstinsky commented 10 months ago

Scope

If the textual data in the validation dataset contains non-ASCII characters, saving the results to storage may encounter errors, such as:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2260' in position 2699: ordinal not in range(128)

The fix consists of defensively converting the contents to the UTF-8 format.

Remark

This error was not observed in isolated GPU environments such as while using EC2 outside of Colab, but this change assures stable behavior in all environments.

github-actions[bot] commented 10 months ago

Unit Test Results

  6 files    6 suites   14m 17s :stopwatch: 12 tests   9 :heavy_check_mark:   3 :zzz: 0 :x: 60 runs  42 :heavy_check_mark: 18 :zzz: 0 :x:

Results for commit 150265e9.

justinxzhao commented 10 months ago

It's probably worth specifying the encoding everywhere we are writing files to disk!