microsoft / terminal

The new Windows Terminal and the original Windows console host, all in the same place!
MIT License
94.86k stars 8.21k forks source link

Fine-grained DWrite text analysis based on text complexity #9156

Closed skyline75489 closed 1 year ago

skyline75489 commented 3 years ago

Description of the new feature/enhancement

Inspired by https://github.com/microsoft/cascadia-code/issues/411, certain ASCII characters sometimes break the simplicity of the entire text, depending on the font being used. The current implementation skips dwrite analysis when the entire text is simple:

if (!_isEntireTextSimple)
{
    // Call each of the analyzers in sequence, recording their results.
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeLineBreakpoints(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeBidi(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeScript(this, 0, textLength, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeNumberSubstitution(this, 0, textLength, this));
    // Perform our custom font fallback analyzer that mimics the pattern of the real analyzers.
    RETURN_IF_FAILED(_AnalyzeFontFallback(this, 0, textLength));
}

With for example Fira Code, in most cases the optimization only applies to lines with 120 spaces, which is not good.

Proposed technical implementation details (optional)

GetTextComplexity can provide a breakdown report of the text, showing which specific range of the text is simple, we should be able to utilize it like this:

for (auto range : complexRanges)
{
    // Call each of the analyzers in sequence, recording their results.
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeLineBreakpoints(this, range, this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeBidi(this, range , this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeScript(this, range , this));
    RETURN_IF_FAILED(_fontRenderData->Analyzer()->AnalyzeNumberSubstitution(this, range, this));
    // Perform our custom font fallback analyzer that mimics the pattern of the real analyzers.
    RETURN_IF_FAILED(_AnalyzeFontFallback(this, range));
}

See #6695 for the introduction of text complexity analysis.

skyline75489 commented 3 years ago

This should also help users who use non-English locales, for example avoid analyze entirely:

η‰ˆζƒζ‰€ζœ‰ (C) Microsoft Corporationγ€‚δΏη•™ζ‰€ζœ‰ζƒεˆ©γ€‚

skyline75489 commented 3 years ago

/cc @miniksa for both sanity & technical check

skyline75489 commented 3 years ago

I've done some experiment and I found that the text complexity is not the same as run splitting. For example with the following text:

η‰ˆζƒζ‰€ζœ‰ (C) Microsoft Corporationγ€‚δΏη•™ζ‰€ζœ‰ζƒεˆ©γ€‚

The text complexity analysis reports (a, b is pos, length pair) :

The run analysis split it into the following runs:

We might also need some sort of RLE implementation to find it a run is entire simple and then optimize the shaping process for the run.

miniksa commented 3 years ago

I agree that we should make use of the additional analysis information to improve performance in this way.

I do think that we could just further split the Runs and give them an additional simple-or-not parameter (bool) during the initial _AnalyzeTextComplexity that is just picked up during _AnalyzeRuns to determine the full analysis or skip and again during _ShapeGlyphRuns to determine the quick-mapping or slow-mapping to glyphs. In lieu of the whole thing being simple, a Run would be simple or not.

I'm not quite sure why your example maps as it does. Are some of those characters UTF-16 surrogate pairs?

skyline75489 commented 3 years ago

those are just normal Chinese characters. Originally I thought text complexity analysis would split the text the same way as run splitting. Just want to add an example to show that it’s not.

a Run would be simple or no

This is likely undetermined. In the example above:

β€œη‰ˆζƒζ‰€ζœ‰ (”

This is a Run. But according to text complexity, the first 4 characters are complex, the last 2 characters are simple. This is what frustrates me. We can’t just simply know a Run is simple or not easily and optimize based on that.

miniksa commented 3 years ago

Yeah but what I'm saying is that we can just call _SetCurrentRun and _SplitCurrentRun inside of _AnalyzeTextComplexity when we start listening to the length of the complexity and add the additional data.

So then you have a [0,4) complex run. [6,8) simple run. [8, 26) simple run. etc. etc.

skyline75489 commented 3 years ago

Doesn’t that bring more fragmentation into the process? Will it affect the line breaking and script analysis result? I need to dig more into this...

θŽ·ε– Outlook for iOShttps://aka.ms/o0ukef

miniksa commented 3 years ago

To your questions: oh probably. It's worth a try though to see if it just works. Sometimes the simple answer is "good enough". If it turns out to not be, we can refine further from there. Feel free to try/dig!

skyline75489 commented 3 years ago

Can we reopen this? #9202 was reverted.

10036 is a unsuccessful attempt to patch #9202.

lhecker commented 1 year ago

AtlasEngine does this! πŸ’–