trapexit / mergerfs-tools

Optional tools to help manage data in a mergerfs pool
ISC License
385 stars 44 forks source link

UnicodeEncodeError in mergerfs.balance #106

Closed cryzed closed 4 years ago

cryzed commented 4 years ago

Hey, I have seen that this was an issue before and that you tried to solve it by replacing sys.stdout and sys.stderr with a TextIOWrapper that hopefully just surrogateescapes encoding errors... but this didn't work for me for some files, causing me to just wrap the print lines in try-except UnicodeEncodeError blocks.

When I took a closer look, I realized there's something strange going on. You shouldn't need the TextIOWrapper workaround at all. Every path that find_a_file (os.walk) returns should be a regular string (Unicode codepoints), i.e. it was already successfully decoded by Python at this point. If you then try to re-encode them through the TextIOWrapper instance that uses locale.getpreferredencoding(False) as its encoding, the encoding might actually fail, because that's not necessarily an encoding that the filesystem itself uses (I think). What is strange however is, that my result for locale.getprefferencoding(False) definitely returns UTF-8, so I'm confused how the encoding error could have arisen in the first place (I definitely don't have filenames/paths that fall outside the UTF-8 table). It feels like something is messing with the set locale while the script is running... rsync maybe? I saw in its source code that it likes to set LC_TYPE to an empty string, but I'm not sure if this would result in a different locale.getprefferencoding(False) output.

My suggestion would be to just force UTF-8 encoding in the TextIOWrapper instances, which IMHO should be able to encode every string you throw at it into bytes -- even if the output in the console might look a bit funny then (because of a locale mismatch), it's better than the script crashing. Maybe this isn't a good idea, I'm not sure. I unfortunately can't test this super well, because I'm already "past" the files that caused errors, and didn't take note of their names. Maybe you have some input on this?

trapexit commented 4 years ago

Funny this should come up for you as someone else ran into similar with my other tool https://github.com/trapexit/scorch/issues/38

And yeah... I addressed it by forcing utf8 in the textiowrapper. It appears to work in their case as far as I think I reproduced it. Been waiting for confirmation before adding it to all my python tools.

cryzed commented 4 years ago

Yes it seems really likely now that something is messing with the locale settings during the runtime of mergerfs.balance, causing locale.getprefferencoding(False) to return "ASCII" instead of the actually configured locale. A possibly workaround might be to store the output of locale.getprefferencoding(False) before you call into other code or run any subprocess and then use the output value from that for TextIOWrappers encoding argument... or maybe just forcing UTF-8 is smarter. You could also not use surrogateescape and use backslashreplace instead. It's a bit confusing what's happening here.

trapexit commented 4 years ago

I'm not sure why I went with surrogateescape over backslashreplace. Might have just been what I found while searching.

trapexit commented 4 years ago

https://raw.githubusercontent.com/trapexit/mergerfs-tools/encoding/src/mergerfs.balance

cryzed commented 4 years ago

Thanks, this seems to work -- but as I said, I'm currently not re-balancing the same files I was re-balancing earlier. I'll keep an eye on it for now.