microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
95.37k stars 8.3k forks source link

[Epic] Text Buffer rewrite #8000

Closed DHowett closed 6 months ago

DHowett commented 4 years ago

This is the issue tracking the great buffer rewrite of 202x.

Aims

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

Other issues that might just be fixed by this

DHowett commented 3 years ago

Alright, I've claimed this issue for the text buffer updates.

KalleOlaviNiemitalo commented 3 years ago

Representative complicated case (huge character)

TAB characters in https://github.com/microsoft/terminal/issues/7810 could also be stored this way.

DHowett commented 3 years ago

@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working.

DHowett commented 3 years ago

Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either...

When the new damage and column-to-region mapping algorithms, this may be a breeze.

naikel commented 3 years ago

Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs.

Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files).

j4james commented 3 years ago

@DHowett I don't think this is as simple as you think. Lets say you write out the string ABCDEFGHIJ, move the cursor back to column 3 (on the character C), and then output a TAB. This moves the cursor to column 9 (on the character I), but there's no visible change to the content. In what way does the buffer now change to reflect that TAB?

Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag (U+1F3F3), a space, and a rainbow (U+1F308), then go back and replace that space with a zero-width-joiner. The buffer sequence is now flag-zwj-rainbow, which Unicode defines as a rainbow flag. How would you alter the display to reflect those 3 characters collapsing into 1?

And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible.

DHowett commented 3 years ago

@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next $TABSTOP characters with a N-column \t if the moved-over portion was otherwise stored as spaces.

ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification.

I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them.

All that, though, is not in scope right now. Incremental progress :smile: while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal.

DHowett commented 3 years ago

@naikel breaking any dependency we have on COORD internally is a significant bonus.

naikel commented 3 years ago

@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch.

EDIT: I really hope you can do this!

DHowett commented 3 years ago

Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs).

struct IMeasurer {
    // Measures one codepoint, stored as UTF-16 code units
    virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;

    // Measures multiple codepoints, stored as a string of UTF-16 code units.
    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
    virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
}
struct UnicodeUCDMeasurer : public IMeasurer {
    // implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
}
struct RendererFallbackMeasurer : public IMeasurer {
    IMeasurer* rootMeasurer;
    RenderEngine* renderEngine;

    // IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
    // This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
}
j4james commented 3 years ago
    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.

Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners.

Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly.

This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse.

skyline75489 commented 3 years ago

Renzhi is a MS employee like me who is moonlighting this project. He is an expert on font, Unicode and layout. He has been working with us internally to improve the overall Unicode support by providing an algorithm for general text width measurement.

获取 Outlook for iOShttps://aka.ms/o0ukef

j4james commented 3 years ago

OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem.

zadjii-msft commented 3 years ago

since we said his name 3 times: @reli-msft

DHowett commented 3 years ago

@j4james I can try to sum up his core suggestion. Sorry -- this will be terse and mostly incorrect.

Using only primitive knowledge of codepoints (no font info), we can approximate the column count and combining behavior of any string of text. Renzhi proposes an algorithm (to be submitted for public standard review at a later date, and consideration by other terminal emulators much in the same way as Egmont's BiDi spec) and a reference implementation (which applications could eventually use for measuring) and an optional font feature that a font author can use to enforce specific constraints (e.g. "the Terminal Width 1 calculator would measure this to be 2 columns; applying font feature tw01 will ensure that this does measure two columns when rendered"). The font stuff aside, though, the primitives are such that we'd store this in the buffer and maintain the invariants that preexisting applications (windows and VT) are expecting. :smile:

It also comes with some suggestions about an "editing zone"/"active zone", which dictates where the cursor is/how appending combining glyphs works/how destruction of multi-column glyphs works.

My primary intent is twofold:

  1. Refactor the buffer to make column counts more explicit, rather than imputing them based on "is the character double-width and does it happen to be actually doubled in the backing store and have we set the flags appropriately"
  2. Have somewhere to store additional info (like cluster counts, bidi paragraph info, etc.)

We're in the always-weird situation where wcwidth is insufficient for things that are more than one wchar in code unit length, and wcswidth doesn't appear to have been holistically designed for this. Hoping that by offering a library and a public spec that others could buy off on we can build some community energy around text layout.

DHowett commented 3 years ago

With the current test implementation, element N in the Row buffer will not map to column N -- simply because we want to to support one code unit taking up two columns (U+30AB KATAKANA LETTER KA) and two code units taking up one column (U+1F574 MAN IN BUSINESS SUIT LEVITATING). Like the DirectWrite clustering algorithm, we need an aside table that lets us figure out what code unit range covers a specific column (or column range). That's transparent, though. The interface for Row exposes NewMagicGlyphAt (name TBD), which will return the entire codepoint (all code units) in a specific column and the original starting column & column extent in case the requested column was partway through a multi-column glyph.

DHowett commented 3 years ago

More typing, more e-mails for everyone.

We can easily keep the invariant that Row::bufffer[N] maps to column N in the DBCS case for U+30AB (simply double or triple the character), but I cannot see how we will keep the invariant for U+1F574 (since it will always be two code units) unless we go to UTF-32, which will certainly double our memory usage in the common case.

We already don't totally have that mapping today... the UnicodeStorage is weird (it's a map keyed on row # + column #) and stores any codepoint that requires more than one code unit. I expect that it is very expensive in the uncommon case, and I have found to date three places were we don't actually manipulate it properly.

Storing everything inline with a running advance counter seems much more . . . civil?

reli-msft commented 3 years ago

@j4james I am expecting that, the measuring algorithm that I am working on, will return the same results as wcwidth if the string doesn't involve proper complex scripts (like Devanagari, or Emoji that used ZWJ).

However when dealing proper complex scripts, that purposed algorithm will take the context into consideration and return more accurate results for them. For example, the purposed measuring algorithm will be able to handle Halants (Viramas), which is frequently used in Indic scripts that (in many cases) turn consonant letters around it into combining forms.

j4james commented 3 years ago

I understand why you want to do this, but if your proposed algorithm is going to break existing software (which I assume will be the case), then all I ask is that we make it opt-in via a private mode sequence. That way, apps that rely on existing terminal behavior will still work, and those that want to make use of the new improved measuring algorithm can still enable it explicitly.

Also, if you haven't already done so, I'd also encourage you to read up on some of the previous proposals for addressing wcwidth limitations, and the problems that arise from trying to make something like this work in a terminal emulator.

DHowett commented 3 years ago

if your proposed algorithm is going to break existing software (which I assume will be the case) (JH)

maintain the invariants that preexisting applications (windows and VT) are expecting (DH)

...will return the same results as wcwidth if the string doesn't involve proper complex scripts... (RL)

I realize that this is caveated (third quote), but existing software that is aware of 1- or 2- column widths shouldn't see a change in behavior.

Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed? Is bad composition the right thing to keep? I suspect applications just avoid it...

I'd previously forwarded @reli-msft the discussions on the VT working group & gnome-terminal trackers about wcwidth and addressing its limitations. We're not working in a vacuum :smile:

j4james commented 3 years ago

Legitimate question though- are there conceivably applications that would be relying on terminal emulators to poorly compose combining characters in such a way that they would be broken if composition were fixed?

Any application that deals with generic text that it needs to wrap or clip to fit an area of the screen. Text editors, email clients, web browsers, irc/chat clients, multiplexers. If the text turns out to be longer than the app expected, that can potentially end up breaking the entire layout of the page.

Then there are things like cursor positioning, highlighting of search terms, and link text. If the app expects a certain word to appear at a particular offset on the line, but it's actually somewhere else because our measurements aren't in sync, then they're going to be highlighting the wrong text, or placing the cursor in the wrong position.

So yeah, it's nice if you can get your combining characters to render perfectly, but not if that breaks everything else.

reli-msft commented 3 years ago

@j4james

I am expecting that for LGC and CJK writing systems, most symbols and pictograms, alongside "simple" Emoji set, the purposed measurer (TMe1) will be compatible with wcwidth.

For the rest, like Devanagari or Telugu, their usage in CLI applications is very limited, and both terminals and CLI applications' current behavior around them is already severely broken. Therefore, it is not a big problem to create a better but non-compatible algorithm for these writing systems. There will also be a termcap for complex-script-awareness.

If strict wcwidth compatibility is really a strong need, the purposed complex script support architecture is also capable to introduce multiple measurers, so we could make either the old or new measurer opt-in.

skyline75489 commented 3 years ago

I’d like to point out that the current TextBuffer implementation is also a huge blocker that stands in the way of high IO throughout & overall performance. The inability to batch read/write a large range of characters in the TextBuffer is disastrous in terms of IO throughput.

If the team decides to improve the ConPTY & WT performance and take it to the next level, this issue is likely the way to go.

Update: An experimental PR #10702 to demonstrate the performance benefit of text-buffer rewrite.

rkeithhill commented 3 years ago

@DHowett Do you think this PowerShell issue is actually a Windows Terminal issue? Note the console width is odd (125). The unknown char glyphs don't show up when the console width is even.

image

We can repro this with WSL/Bash in Windows Terminal (no PowerShell). But we can't repro in other terms (iTerm) using PowerShell.

lhecker commented 1 year ago

Thinking about this issue some more, I realized that almost all of our remaining problems can be expressed with just a single goal in mind: The IsGlyphFullWidth function needs to be removed.

Of course it has to remain in some places for the terminal to function (TextBuffer primarily), but in most places it needs to go. This is for two reasons:

Additionally OutputCellIterator should be merged into TextBuffer and ROW and it's 3 calls to IsGlyphFullWidth be replaced, so that we can add grapheme cluster segmentation there, which can't be trivially represented using an iterator like OutputCellIterator.

This requires the following changes:

Anutrix commented 1 year ago

Since 2022 is close to an end, any timeline on this?

https://github.com/microsoft/terminal/issues/1947 (issue that I am facing)+ https://github.com/microsoft/terminal/issues/1860 is making it almost impossible to use when SSHing into lot of devices/VMs.

And this ticket seems to be a blocker for #1947's potential MR.

zadjii-msft commented 1 year ago

Hmm. We got rid of UnicodeStorage in #13626. IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.

ofek commented 1 year ago

IRM might be easier now, now that we don't need to re-key all that. We may want to revisit.

Did we end up revisiting this?

j4james commented 1 year ago

@ofek Support for IRM was added in Windows Terminal Preview v1.17.1023.

mominshaikhdevs commented 1 year ago

Screenshot 2023-05-02 161212 why this task is marked as completed when the linked issue is open?

lhecker commented 1 year ago

The linked issue (#1472) is just a reference what that bullet point is for and not the task itself. Thinking about it some more, I realize that this issue (#8000) is actually kind of done already. I rewrote the buffer in #13626 and the other commits I've done since aren't really related to the "Text Buffer" anymore and they're more related to #1472, hmm... Eh, I guess it's not really that important.

mominshaikhdevs commented 1 year ago

can someone update the issue description? #30 seems to be the only remaining open issue at this point.

lhecker commented 6 months ago

With only a few places left that rely on the old grid-based buffer layout, all of which require some long term, but focused work, I'll be going ahead and close this more general issue.