migueldeicaza / XtermSharp

XTerm emulator as a .NET library
MIT License
161 stars 34 forks source link

some remarks regarding the parser #42

Open jerch opened 4 years ago

jerch commented 4 years ago

xterm.js' parser was written with JS string (UTF-16 codepoints) in mind. Plz note that it will not work correctly with raw UTF8 bytes (for any byte with the 8th bit set). For this to work you would have to rewrite Parse in a way to correctly decode multibyte chars on the fly. Another and prolly easier approach is to simply decode the whole byte chunk beforehand to UTF16 or UTF32, this way the parser can stay mostly untouched. In xterm.js we went with UTF32 after some longish testing, mainly for performance reasons (with a small memory sacrifice compared to UTF16).

Furthermore the older version of the parser has some loose ends, that are already fixed upstream:

Last but not least the parser has seen some code/perf improvements, but those are mostly JS engine related and less likely show a big impact in C#. On a sidenote - the somewhat complicated print handling with aggregating chunk slices shows a major perf improvement in JS (5 to 10 times compared to single char handling), which is not true for the original C parser. In C jumping for every single char into print is much faster, prolly due to heavy inlining/unrolling (JS JITs are still lacking in this regard). Not sure how C# would do here. In general the perf of the parser should not be an issue, in JS it handles typical sequences with 80-150 MB/s (the C variant only being twice as fast).

migueldeicaza commented 4 years ago

Hello,

Thanks for sharing with me these important updates to xterm.js, I will try to bring some of those, and will keep this ticket open until I sort those out.

I do not think that a terminal should be implemented with UTF-16, specially not to recover from bad data (either cat /dev/random, or noise in the line, noise by background processes). So the terminal emulator should work without assuming that a correct mapping from the byte stream to unicode codepoints exists.

My Swift port (incomplete, that is why it is not public, but I can make the partial port public if there is interest) further shows the difference in the low-level parsing implemented with bytes and the surfacing to grapheme clusters (which are better than codepoints, which is what the XtermSharp uses in the form of Rune). The Swift port further cemented my understanding in this, and has a couple of improvements to the parser to make this distinction more obvious, and more resilient. I am convinced that the parser should never deal with Unicode, and only covert to unicode the blocks that have successfully been parsed.

Thanks for linking to the changes in upstream, I should try to bring those in.

Miguel.

jerch commented 4 years ago

I do not think that a terminal should be implemented with UTF-16, specially not to recover from bad data (either cat /dev/random, or noise in the line, noise by background processes). So the terminal emulator should work without assuming that a correct mapping from the byte stream to unicode codepoints exists.

The emulator cannot work reliable without that notion. This is also the reason why all emulators nowadays treat unicode in terminal streams as "transport encoding", where the whole stream gets treated as of encoding XY, not only string parts.

Grapheme support is an annoying issue, that gets more and more pressing since unicode 9+ releases. Currently none of the common emulators/terminal libs handle those, we are basically stuck with (slightly wrong) wcwidth tables. Also unicode has introduced some terminal unfriendly rules (all rules that depend on renderer caps), we def. need some formal specification in this field for terminal usage.