Line length should consider Unicode character width

uranusjr commented 4 years ago

Is your feature request related to a problem? Please describe. A clear and concise

In some cases one would want to put non-ASCII characters in source. This is most common for strings, but in Python 3 you can use non-ASCII identifiers as well.

But consider the following example:

parser.add_argument(
    "poem", help="Poem to process",
    default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは",
)

The default line is visually long, and I’d like to put it on its own line. But Black would insist formatting it into

parser.add_argument(
    "poem", help="Poem to process", default="ちはやふる神代もきかず竜田川からくれなゐに水くくるとは"
)

since it thinks there is enough space (indeed, the formatted line is only 73 characters long).

Describe the solution you'd like

When calculating line lengths, Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

Describe alternatives you've considered

Having strictly single-width characters in source, and pull in double-width strings at runtime (via i18n or \u sequences. The translation system is usually an overkill cases (e.g. developing an in-house CLI where all users speak a common East Asian language), and replacing all strings with \u make the source illegible to humans.

Usgin fmt: on/off becomes tedious and hard to the eye quickly when you need a lot of string scattered around (e.g. for logging purposes).

Additional context Add any other context or screenshots about the feature request here.

Screenshot 2019-12-09 at 15 00 35

mk-pmb commented 4 years ago

Black should consult the width attribute of each character, and count double width for characters with EAW F or W.

:+1: and I'd be willing to try and implement it if maintainers agree they want the feature.

mk-pmb commented 4 years ago

Related @ maintainers: Would you also want zero-width characters to not add to line length?

JelleZijlstra commented 4 years ago

This makes sense to me in general, but I'm not familiar with the Unicode subtleties involved here.

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

mk-pmb commented 4 years ago

for the common case of having only normal-width characters.

This is partially a chicken-and-egg problem: In a culture where it would make sense to use lots of fullwidth characters, a lack of tooling support might nudge people to making their code base more "common" in the meaning of "more American".

mk-pmb commented 4 years ago

Still it doesn't change the fact, even in an asian python code base, there will be lots of lines with just American characters. We could, as a first mitigation, scan the line for whether there are any multibyte chars in them at all. Maybe I can even optimize it so only the multibyte chars are checked, not sure whether that will really be faster. Another optimization could be to not actually check each char but only larger chunks so we could make a regexp range. I'll go research Unicode to be able to give better feedback on this.

mk-pmb commented 4 years ago

(Sorry to anyone annoyed by my multi-replies; I do it so people can vote on them separately.)

how good, Unicode-aware code editors display these characters

That differs widely. As I understand the black philosophy, I assume the black answer would be "Anyone who uses such characters should just get editors that can deal with them properly."

However, I personally tend to have more of a lenient "let them configure stuff" attitude. Sometimes all the editors that manage to get the one feature right, are lacking lots of other features. And support for asian-specific bugs is (in my view) very lacking in lots of free software projects, so I think it would not be helpful to present users with yet another case of "your problem almost could have been solved, but…".

uranusjr commented 4 years ago

Thanks for the discussion. Some of my own observations:

I think the relevant question for computing line length is how good, Unicode-aware code editors display these characters. So if zero-width characters are actually normally displayed as zero-width, we should count them as zero-width.

The support for EAW (i.e. single vs double width) is generally quite good. All GUI (including web) ones are very good at this. CLI editors generally rely on the terminal’s support, which is also quite universal these days, since it affects emojis as well.

Support is variant for characters with ambigious or unspecified length. with most engines displaying them as single-width (even zero-width characters). But IMO this is not a problem from a practical standpoint.

People tend to care less (at least I do) when the line wraps too early (i.e. break at 86 instead of 88 characters), and it’s unlikely for a source have a lot of zero-width characters in your string[]. EAW characters, however, make the line wrap too late*, with extremely long lines that cause line overflow or soft wrapping, both very noticible and with significant usability issues.

[*]: Non-printing characters are not a good idea anyway since they are inherently hard to spot, causing difficulties in maintainance. It is more advised to use escape code instead. Combining characters should be counted as zero-width when they are combining in an ideal world, but they are nore difficult since the length depends on context. And again, this is less of a problem in practice since it only causes early wraps, not late.

I'm also concerned about whether this change could affect performance for the common case of having only normal-width characters.

I understand what you’re trying to say, but this expression is slightly, um, annoying (avoiding the r word here). Hopefully you don’t really mean our native languages are not as “normal” as English? 🙂

mk-pmb commented 4 years ago

not a problem from a practical standpoint.

If we simplify the problem to the subset relevant for real-life use cases, and…

People tend to care less (at least I do) when the line wraps too early

… simplify further to accept some amount of too-early line break, I guess we'll end up with a relatively small set of character range expressions and an almost trivial implementation as regexp or filtered list comprehension, whichever will turn out more performant.

Next we'd need to decide which real-life codebases we want to use for the performance testing.

rbanffy commented 3 years ago

I was just bitten by this issue with composing characters. I assume this happens with languages that have accents that don't have corresponding unified accented symbols. In this case, while looking to implement missing symbols from the IBM 3270 terminal, I implemented numbers with dotted overlines that are sometimes displayed in their status line with the "combining four dots above" symbol (u+20dc). In the example below, the line is considered too long because every combined character is counted as two.

composed_glyphs = "Composed glyphs: ÀÉI͂o̓N̈́AͅB̊͆Ȍ͇U͈D̈ẢB̊A̋ĎA̍J̎Ȁ1⃜2⃜3⃜4⃜5⃜6⃜7⃜8⃜9⃜0⃜\n"

I solved it by breaking the line, but I wouldn't say it's pretty.

https://github.com/rbanffy/3270font/blob/710719fb18705137a6440b8cc0c4d0368f690846/test_font_rendering.py#L16-L19

This change to my code is in commit https://github.com/rbanffy/3270font/commit/710719fb18705137a6440b8cc0c4d0368f690846

I'd love to help fixing this. Is anyone working on it?

mk-pmb commented 3 years ago

In case anyone hoped for me, because I was very interested back then – nope, I ditched black meanwhile.

hirmiura commented 1 year ago

bump. I would like this feature too. black日本語

dahlia commented 1 year ago

I tried to address this (partly): https://github.com/psf/black/pull/3445. However, I am not sure if my approach is appropriate and I implemented things properly. I would appreciate if someone take a look at my patch. Thanks!

psf / black

Line length should consider Unicode character width #1197