Open WoodyAtHome opened 5 years ago
What is the result you were expecting?
Hi Sean, I expect to count characters, not bytes. I know it's not easy for all the different encodings. Maybe there will be first a fix for utf-8 encoding and later for utf-16 and so on. Especially for "Format" it make absolute no sense to rely on bytes.
You can see it here: https://playground.ponylang.io/?gist=e7c5aadcfe82da26a7df67da7d6a6c9c
I would expect
|1234|
|u |
|ü |
Priority is not high, of course. But in the longer term we should have a solution.
Using the String method codepoints()
instead of size()
in the Format package would work for this particular case (for ü
, it returns 1 instead of 2) and other special characters represented by multiple bytes. However, it would still break for Unicode characters with width different to one (zero-width space, diacritics, etc.)
So, @WoodyAtHome you would expect the length of each to be "1"? I'm unclear based on your initial comments and the 2nd what you were expecting.
Yes, I would expect the length of each character to be 1. Forget my first comment, if you don't understand.
@EpicEric got it, format should use codepoints
.
As @EpicEric mentioned, simply using codepoints for formatting might lead to unexpected results when using other languages than english. If we want to make it proper, we'd use Grapheme Clusters and apply the rules given here in order to determine the actual formatting-width of a given string: https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
This is a major effort and requires handling and parsing unicode data files, like e.g. elixir is doing: https://github.com/elixir-lang/elixir/tree/master/lib/elixir/unicode
This is a major effort and I would be fine with changing the format
package to simply use String.codepoints()
if it clearly and visibly documents this limitation in the face of more complex unicode constructs.
Format("u" where width = 4) results in String with len = 4, "u ", that is ok Format("ü" where width = 4) results in String with len = 3, "ü " on a utf8 encoded system, that is not ok
I realize that's because of the 2 byte length of non ASCII Character and format counts bytes and not characters, but the result finally doesn't look very nice.