ponylang / ponyc

Pony is an open-source, actor-model, capabilities-secure, high performance programming language
http://www.ponylang.io
BSD 2-Clause "Simplified" License
5.71k stars 415 forks source link

ponyc/packages/format formats wrong width for 2 (or more) bytes utf characters #3125

Open WoodyAtHome opened 5 years ago

WoodyAtHome commented 5 years ago

Format("u" where width = 4) results in String with len = 4, "u ", that is ok Format("ü" where width = 4) results in String with len = 3, "ü " on a utf8 encoded system, that is not ok

I realize that's because of the 2 byte length of non ASCII Character and format counts bytes and not characters, but the result finally doesn't look very nice.

SeanTAllen commented 5 years ago

What is the result you were expecting?

WoodyAtHome commented 5 years ago

Hi Sean, I expect to count characters, not bytes. I know it's not easy for all the different encodings. Maybe there will be first a fix for utf-8 encoding and later for utf-16 and so on. Especially for "Format" it make absolute no sense to rely on bytes.

You can see it here: https://playground.ponylang.io/?gist=e7c5aadcfe82da26a7df67da7d6a6c9c

I would expect

|1234|
|u   |
|ü   |

Priority is not high, of course. But in the longer term we should have a solution.

EpicEric commented 5 years ago

Using the String method codepoints() instead of size() in the Format package would work for this particular case (for ü, it returns 1 instead of 2) and other special characters represented by multiple bytes. However, it would still break for Unicode characters with width different to one (zero-width space, diacritics, etc.)

SeanTAllen commented 5 years ago

So, @WoodyAtHome you would expect the length of each to be "1"? I'm unclear based on your initial comments and the 2nd what you were expecting.

WoodyAtHome commented 5 years ago

Yes, I would expect the length of each character to be 1. Forget my first comment, if you don't understand. @EpicEric got it, format should use codepoints.

mfelsche commented 5 years ago

As @EpicEric mentioned, simply using codepoints for formatting might lead to unexpected results when using other languages than english. If we want to make it proper, we'd use Grapheme Clusters and apply the rules given here in order to determine the actual formatting-width of a given string: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

This is a major effort and requires handling and parsing unicode data files, like e.g. elixir is doing: https://github.com/elixir-lang/elixir/tree/master/lib/elixir/unicode

This is a major effort and I would be fine with changing the format package to simply use String.codepoints() if it clearly and visibly documents this limitation in the face of more complex unicode constructs.