Multi-width unicode characters are not supported

muesli4 commented 4 years ago

Unicode has some characters that even in monospace have different widths (multiples of the base-width is my guess). In that case, any cell formatting is done wrongly because it uses the assumption that all characters have the same width.

It is unclear how one could determine the width of a unicode character. Sometimes it even seems to depend on the locale.

This can be fixed solely within the Cell type class because the algorithms rely only on that. Cutting within a character is an issue. In that case it is possible to replace it with spaces (in the drop functions). Unfortunately, all operations now require linear time.

ony commented 4 years ago

@simonmichael says in this comment:

As it says: "From Pandoc." I guess Pandoc gets it from the official Unicode standard.

Looks like Pandoc extracted charWidth to doclayout but still use hard-coded values without providing information about their origin. It should be Unicode standard, but which version?

@simonmichael, I think both libraries (this and hledger) can benefit from doclayout, but it is not yet on Stackage.

muesli4 commented 4 years ago

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

Another solution would be to use https://github.com/JuliaStrings/utf8proc/blob/20672dba69bf463be22f6c9c216d858c9d116bb6/utf8proc.h#L646 but that adds utf8proc as dependency.

ony commented 4 years ago

Another solution would be to use https://github.com/JuliaStrings/utf8proc/blob/20672dba69bf463be22f6c9c216d858c9d116bb6/utf8proc.h#L646 but that adds utf8proc as dependency.

This falls under the same category "parse Unicode report". First they convert EastAsianWidth.txt to CharWidths.txt with this parser. And then generate table in C code.

ony commented 4 years ago

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

This is "self-driving car"... When many applications do not ad-here standard this is called absence of standard. This way applications start to add quirks on tangent points between each others instead of following common interface.
By giving up on confirming with Unicode recommendations you contribute to that.

hasufell commented 3 years ago

I guess I have this problem in ghcup:

ghcup-table-layout

code is here: https://gitlab.haskell.org/haskell/ghcup-hs/-/blob/master/app/ghcup/Main.hs#L1411-1458

muesli4 commented 3 years ago

@hasufell Are you using multi-width characters? Because it doesn't seem that way. It seems this caused by the backend-specific control characters (see #4). Please have a look at the documentation of the Formatted type. You may be able to write an instance of System.Console.Pretty that uses Formatted. Unfortunately, in the implementation the format instructions are not separate from the text. But this is necessary to measure the text width. However, it should be relatively easy to refactor this in the library.

If your problem is not related to multi-width characters and the Formatted type does not solve your problem, would you be so kind and open a new issue?

edit: I just noticed that there is #11 which may be relevant to your use-case. You could also put the values on different lines, then the per-cell color is not an issue.

muesli4 commented 3 years ago

What I read is that those things are not completely standardized. With different locales there is ambiguity for some characters and it also depends on the font.

This is "self-driving car"... When many applications do not ad-here standard this is called absence of standard. This way applications start to add quirks on tangent points between each others instead of following common interface. By giving up on confirming with Unicode recommendations you contribute to that.

@ony Trust me, I want to adhere to the standard as much as possible. In fact, that is the main reason why I do not want to adopt it yet for the default instance. If you can show me that there is a standardized way to determine the character width of unicode characters, I will be the first to accept it. That doesn't mean we can't write an instance at all in the meantime. Contributions are welcome and I'm happy to work on this together or provide any support that is necessary.

hasufell commented 3 years ago

Please have a look at the documentation of the Formatted type. You may be able to write an instance of System.Console.Pretty that uses Formatted. Unfortunately, in the implementation the format instructions are not separate from the text. But this is necessary to measure the text width. However, it should be relatively easy to refactor this in the library.

Sorry, I can't really follow this or how to fix it.

You could also put the values on different lines, then the per-cell color is not an issue.

That's not a possibility

simonmichael commented 3 years ago

@muesli4, FWIW: there are some helpers (charWidth, strWidth, textWidth, stripAnsi) in hledger-lib which could give inspiration for this and #11.

hasufell commented 3 years ago

Yes, the functions @simonmichael describes work well. I dropped my use of table-layout and reimplemented simple row-column padding with that: https://gitlab.haskell.org/haskell/ghcup-hs/-/commit/40a1cc98c6ea7eb06eeca7a37915a5075451420b#c84b8cca7fc11e84e49df98e5e56e35d46791361_1560_1558

ony commented 3 years ago

@ony .... If you can show me that there is a standardized way to determine the character width of unicode characters, I will be the first to accept it. ...

I think I gave some links already in my comment to https://github.com/simonmichael/hledger/pull/905 . Related standards:

Unicode standard annex #11 for East Asian Width that define relative width in fonts and associated with it table EastAsianWidth.txt.
C function wcwidth which is part of POSIX.1-2001 and POSIX.1-2008 standards and puts some meaning into relation between "width" and terminal columns. Should be available on any system that promise that including Linux, FreeBSD, Windows, MacOS. For Haskell we have unmaintained wcwidth and as I know no other Haskell standard libraries that provides this information.

I agree that there is no clear standard "Unicode for terminals".
But if you look into my comment where I traced origins of code similar to what @hasufell adopted from hledger, that in its turn adopted from pandoc, you'll see how other tries to adhere Unicode and bypass wcwidth. Since this library is quiet generic and needs to pad with spaces to align to specific columns, I thought it might want to implement it properly or better spin dependency that provides terminal specific interpretation of Unicode, or help in reviving Haskell bindings to wcwidth.

P.S. This cross-repo thread started with cheese 🧀 (part of Unicode 8) in someones financial report. P.P.S. My memory tells me that I also went through Julia language, but it is not mentioned in my comment :( . Anyway check https://github.com/JuliaStrings/utf8proc/issues/114 for example where they refer to EastAsianWidth.txt. P.P.P.S @hasufell, tight/tough world of Haskell developers/enthusiast using Exherbo Linux.

ony commented 3 years ago

There are more hardship with "ZERO WIDTH JOINER" that may turn 5 "characters" in a single glyph if both terminal and font supports it. To make it predictable we may want to strip ZWJ.

Xitian9 commented 2 years ago

Would you be open to a PR incorporating the functionality @ony and @simonmichael suggested above? Is the blocker here just the work needed to do it, or are there other considerations?

muesli4 commented 2 years ago

I am sorry for the delay with this issue, I simply do not have a lot of time at the moment. However, I created a type class Cell that was intended to be used for this purpose. The functionality may be implemented as a (parametrized) newtype for now either in this library or another one. Perhaps, it is better to provide it as another library, then we do not need to change the dependencies (if there are any).

Would you be open to a PR incorporating the functionality @ony and @simonmichael suggested above?

I am happy to accept pull requests. However, it would be good if you could give an idea of your implementation.

Is the blocker here just the work needed to do it, or are there other considerations?

When I was looking into this I read that there is not really a standard and it seemed more like a hack to me that sometimes works and other times not. But I may be wrong and I don't exactly remember. But then again, if we provide this as an opt-in feature I see no problem at all.

muesli4 commented 2 years ago

I manually added the changes from the pull request.

muesli4 / table-layout

Multi-width unicode characters are not supported #8