Open RayDillinger opened 2 years ago
As a CJK user, monospace has never really meant monospace to our fonts. It's "this will align on a grid, usually but not always according to what wcwidth()
wants to say".
The actual problem is beyond the font: |*| Some by-design problems of Unicode [ 1 facet: https://github.com/MasterInQuestion/Markup/blob/main/AAA.htm ] |*| Monospace-incompatible nature of certain scripts: e.g. Arabic cursive [ Demo: https://github.com/microsoft/vscode/issues/11770#issuecomment-1320735785 ]
The comb-char problem is, well, not related to this issue. Normalization exists. wcwidth does not label the combing stuff as having width.
Bidi also doesn't affect width. How combining widths in combine-y scripts act are written is well-established.
For an actually useful discussion about the whole fontconfig "what's a mono" thing, see https://gitlab.freedesktop.org/fontconfig/fontconfig/-/issues/176.
If I understand correctly, the mentioned post means: Managing the font width of all characters to be of a multiple of the typically character width (1 ch). . This should work for most characters; but deploying which on cursive scripts (e.g. Arabic, Devanagari) seems to be troublesome.
[ Quote Artoria2e5 @ CE 2023-02-20 10:30:24 UTC: https://github.com/notofonts/latin-greek-cyrillic/issues/234#issuecomment-1436711673 The comb-char problem is, well, not related to this issue. ] <^> [ Quote RayDillinger @ CE 2022-06-24 07:46:01 UTC: https://github.com/notofonts/latin-greek-cyrillic/issues/234#issue-1285558279 Monospace is not a joke for programmers. ... And it's one more whack-a-mole to be on guard against for people trying to sneak deceptive code into our systems using, among other things, lookalike identifiers that are not the same. Please don't promise monospace meaning certain deceptions CAN'T happen, and then make them happen anyway. Ligatures need to look different from their compatibility decompositions because compilers will read them differently. ] <.> Indeed not much related to monospace. But related with the issue described.
[ Quote Artoria2e5 @ CE 2023-02-20 10:30:24 UTC: https://github.com/notofonts/latin-greek-cyrillic/issues/234#issuecomment-1436711673 Bidi also doesn't affect width. ] <^> What affects width is cursive property.
[ Quote (previous): How combining widths in combine-y scripts act are written is well-established. ] <^> What's the meaning of this line? What is "combine-y script"? [ Y-Combining (vertically combining) characters (e.g. "é́́")? ]
combine-y, which is "combine" with the adjective prefix "y". Better with "-ish" or "ly", I suppose.
Anyways, your cursive problem is more about scripts with mandatory ligatures. Guess what? Fonts handle it just fine, you just need to find the right one.
Using the widecharwidth.js script (with a syntax fix in PR), we run:
let lit = [...'كوكب مونو']
function xwidth(c) {
let x = widechar_wcwidth(c)
if (c == ' ') x = 1;
return x < 0 ? 0 : x;
}
console.log(lit.reduce((total, ch) => total + xwidth(ch), 0))
This predicts a size of 9 columns. With a working Arabic monospace font, Kawbab Mono, we measure the width in notepad:
9 Indeed!
Why are we talking about this? Noto LGC is by definition not arabic.
Thanks for your participation. Noto LGC is not, Noto Mono not necessarily so.
Probably not fine under extensive usage. Anyway, any non-LR-TB (non: Left to Right, Top to Bottom) script already faces a much more serious problem in general information exchange. [1]
Troublesome not necessarily impossible (haven't thoroughly verified): Because of the appearance of such characters being dependent upon surrounding characters. [1] And the exact rules determining such characters' look being illusive.
[1] Displayed in aforementioned demo.
Noto Mono is LGC, not Arabic. I'm starting to lose the thread of this discussion...
Yet so. Provision for the future. Previous information contained the caveats for font design, and mostly inclined for monospace.
The primary thread is actually "Monospace not monospace" and the AΑА Αbsurdity problem of Unicode.
Don't call this "absurdity". Confusable characters that belong to very different alphabets, this also sort differently in multilingual sequences (with primary differences), and that were disunified everywhere (including in ISO 15924) have a real sense.
Also confusability can occur with various characters of many scripts. Not just letters, but also dot-like, and hyphen-like characters that may represent also something else than plain letters and must be treated differently. It is still possible in a document to render different scripts with different styles (e.g. plain vs. italic). For cases where this is not possible in plain-text protocols, there are whole specifications about how to detect confusables (e.g. for i18n domain names with standardized ways for browsers and domain registrars to restrict/forbid mixed labels or alert users if this is a security concern because someone tweaks the scripts).
Unicode/ISO/IEC 10646 does not encode glyphs, but characters for their semantics and fundamental properties. Don't be abused by moden uses; those alphabets do not necessarily supprot the same variations of glyph styles.
Further discussion on AΑА Αbsurdity of Unicode: https://github.com/MasterInQuestion/talk/discussions/4
Further discussion on AΑА Αbsurdity of Unicode: https://github.com/MasterInQuestion/talk/discussions/4
Look at the reply I've made there. You are making confusions between URIs and IDNs (they are specified completely independantly, even if some "URI schemes" allow embedding one or more IDNs in a single URI).
URIs use an encoding scheme (restricted in its charsets specified for each URI part) which is completely independant from the encoding scheme in IDN (specified using Punycode, based on DNS restrictions, plus registrar-specific restrictions for their permitted subset: each registrar for each specific domain or subdomain may forbid or permit confusable characters in labels, but it's not their job to "canonicallize" characters they consider equivalent and they are not required to make such remapping, even if they apply some restrictions where all confusable domain labels are reserved at the same time; they are not required however to implement CNAME aliasing, and generally using CNAME is a bad practice notably for secured domain names (almost mandatory now for URLs on the web, as many more webbrowsers are blocking or suppressing support for unsecured HTTP and FTP), unless PKI certificate for a given domain are enumerating all their possible CNAME aliases implemented by the registry, or are using wildcard entries which are also known to be security issues, and that's why DNS providers disable wildcards by default for all domains they are hosting).
Please read RFCs carefully, this is not something you can guess with insufficient assumptions. This is absolutely not "absurdity", and all this is completely independant of Unicode/ISO/IEC 10646 standards. What would be "absurd" would be to make false assumptions because you understand those topics only superficially. These restrictions and specifications are very important for all users everywhere in the world and for all apps, for interoperability and security.
And AAAA records in DNS are not specific to Unicode, they are for IPv6 (absolutely needed today by many millions users). But IPv4 and IPv6 are also fully independant of character encodins used in labels for domain names: the same domain name, whether its labels are internationalized using Punycode or restricted to the small ASCII subset with only Basic Latin letters, single dashes or digits, can be registered to provide resolution for both IPv4 and IPv6 with A and AAAA records.
As said in issue 30, this causes launching a Qt app to complain that the font isn't really mono-spaced.
And Noto Sans Mono
is the default mono-spaced fonts in fontconfig
, so it will affect most Linux distributions.
I have suggested removing the font from fontconfig
till this bug is resolved, here.
@es20490446e, my (and if I recall correctly, the fontconfig people's) opinion is this: Qt is wrong. If you have read the previous fontconfig issue at https://gitlab.freedesktop.org/fontconfig/fontconfig/-/issues/176, you should know that already.
How is Qt wrong, exactly? See https://bugreports.qt.io/browse/QTBUG-67612, also mentioned in fc issue 176.
The legacy term "monospaced' is misleading. Actually what it means is that the font is designed to be used with its characters (or clusters and ligatures) aligned with boundaries of a grid, each of them best using the area of one or several cells forming a rectangular area.
This is especially important for East Asian typography, not just for display on basic computer terminals, but as well for rendering at small sizes e.g. for billing tickets, or (electro)mechanical displays on roads or in severe conditions where lightened electronic dislpays are not usable or must continue working without a power source (needed only for changing the displayed content).
Thanks for the info.
It seems like a Qt bug, I made a note there.
Defect Report
Noto Mono is no longer a monospaced font.
Composed ligatures such as DŽ Nj and LJ now occupy 1000 units of horizontal space when they should occupy 500 like everything else. If you are looking at this message in Noto Mono and can't tell those one character examples from DŽ Nj and LJ respectively, then you have already reproduced the problem.
Monospace is not a joke for programmers. I do in fact want to know how many characters there are in a string - for real, even if one of them is a ligature - by looking at where it lines up under the column index at the top edge of my editing window. I don't want column-aligned code to wind up out of line (or delete anything important) because my editor deleting a few characters on each line for alignment purposes unexpectedly zaps two columns instead of one.
And it's one more whack-a-mole to be on guard against for people trying to sneak deceptive code into our systems using, among other things, lookalike identifiers that are not the same. Please don't promise monospace meaning certain deceptions CAN'T happen, and then make them happen anyway. Ligatures need to look different from their compatibility decompositions because compilers will read them differently.
I believe this stems from a "fix" for an earlier bug where correctly monospaced ligatures were being incorrectly substituted for input characters not intended as ligatures. These double wide ligatures fix the visual bug, but do so by breaking the monospace property rather than by fixing the unintended substitution.