Low performance when editing content with long lines and Unicode characters

wmjordan commented 7 years ago

I used VS 2017 debugger to monitor the performance of Scintilla and found that the performance of SetStyling was not so good. It took about 1 second to style 700 tokens (700 calls to the SetStyling method), less than 5000 characters. No time consuming operations were taken place between calling that method.

I examined the code and found that the SetStyling method called DirectMessage to SCI_LINELENGTH, SCI_POSITIONFROMLINE and SCI_POSITIONRELATIVE substantially, which might be the reason for the slow down. Is there a way to skip some P/Invoke calls and make the SetStyling faster?

jacobslusser commented 7 years ago

Perhaps... sounds like you might be in a good position to offer some suggestions. :)

wmjordan commented 7 years ago

I have little knowledge of Scintilla.

I saw you tried to determine whether the content being styled contained Unicode characters and the content I was trying to style did contain Unicode characters. Do you know how Scintilla store characters internally? If we know how characters were stored, we can use GetTextRange to extract the text to be styled, get byte counts on our own side and then call SCI_SETSTYLING with that byte count, which will reduce multiple P/invoke calls to one and hopefully can enhance the performance.

jacobslusser commented 7 years ago

So Unicode is the real issue. Internally Scintilla stores text as UTF-8 bytes, so there is not a one-to-one mapping of characters to bytes and thus we need to make a few pinvoke calls to SCI_LINELENGTH, SCI_POSITIONFROMLINE and SCI_POSITIONRELATIVE (among others) and so some calculations. All of this translation happens in the LineCollection class and that is where you will find the helper methods you are looking for for converting byte to char positions and vice versa. There is already some significant optimizations going on in LineCollection (caching) which I won't try to explain and which I don't think you'll need to know to implement the approach you are suggesting.

In other words, I think you are correct to approach this as a problem of 'batching' multiple calls and/or 'caching' expensive position lookups/translations. In some combination of those hopefully you can minimize the number of pinvoke calls.

wmjordan commented 7 years ago

Thank you for your explanation. Unicode was really the problem. I tried another longer code piece, without Unicode characters, styled with the same lexer, and I did not notice there was a performance issue.

Since UTF-8 was used internally by Scintilla, is it possible to use System.Text.Encoding.UTF8Encoding.GetByteCount to calculate the byte count instead of batches of P/Invoking calls? I'll give it a try if so.

wmjordan commented 7 years ago

I changed my code to style my tokens with the following line like:

scintilla.DirectMessage(STL_SETSTYLING, (IntPtr)token.ByteCount, (IntPtr)(int)token.SyntaxType);

The ByteCount was the UTF-8 byte count of the token. The styling was still correct after the modification and the time consumption dropped from seconds to several milliseconds.

However, I found another problem. While I was creating tokens, I now had to call GetTextRange quite a few times, which had the same performance problems. I had to find workaround for it as well.

jacobslusser commented 7 years ago

How are you arriving at the UTF-8 byte count?

The reason ScintillaNET does what it does is because a string like "Hèllo Wòrld" will have a length of 11 in C#/.NET, however, that same string will require 13 bytes to represent in UTF-8 (because of the grave marks). ScintillaNET APIs have been designed so that they work well with C#/.NET strings and developers don't need to make those kind of conversions manually. If you aren't accounting for the same, your approach will fail when you encounter some unicode characters.

If you are taking that into account, then ignore the caution.

As for GetTextRange, you can 'batch' this in the same way as I suggested for SetStyling. For example, instead of enumerating tokens and then calling GetTextRange for each, call GetTextRange for an entire line of text and then enumerate the tokens in that string. If that isn't enough, scale up to multiple lines (paragraphs), etc... This suggestion comes from the Custom Syntax Highlighting wiki recipie:

For performance or practical reasons you may prefer to get an entire line of text a time. It's entirely up to you how you go about writing your lexer.

wmjordan commented 7 years ago

Yep, I did the following calculation in my code like

var byteCount = System.Text.Encoding.UTF8.GetByteCount("Hèllo Wòrld");
// byteCount == 13

I modified my lexer, took your advice to get the entire line of text and calculated the byte count for each token. The performance became acceptable eventually.

BTW, is it possible to recompile Scintilla to use UTF-16 little endian internally instead of UTF-8? If so, we could save a very considerable amount of calculations on byte alignments.

jacobslusser commented 7 years ago

UTF-8 is the only Unicode mode supported by Scintilla.

wmjordan commented 7 years ago

I checked out the source code of Scintilla and searched the web. I found that I had run into a ten-years-ago issue. In year 2006, someone had already submitted a feature request to support UTF-16 in Scintilla for the sake of inter-operating to the huge amount of Windows Unicode applications. https://sourceforge.net/p/scintilla/feature-requests/328/ Last month someone had posted a "patch" to support UTF-16: https://sourceforge.net/p/scintilla/feature-requests/1175/

Finally I modified the source code of ScintillaNET to alleviate the Unicode performance issue. It is quite dirty still, but the performance is somewhat improved when a line is long and contains a lot of Unicode characters.

internal int CharToBytePosition(int pos)
{
    Debug.Assert(pos >= 0);
    Debug.Assert(pos <= TextLength);

    // Adjust to the nearest line start
    var line = LineFromCharPosition(pos);
    var bytePos = scintilla.DirectMessage(NativeMethods.SCI_POSITIONFROMLINE, new IntPtr(line)).ToInt32();
    pos -= CharPositionFromLine(line);

    // Optimization when the line contains NO multibyte characters
    if (!LineContainsMultibyteChar(line))
        return (bytePos + pos);

    if (pos < 10) {
        while (pos > 0) {
            // Move char-by-char
            bytePos = scintilla.DirectMessage(NativeMethods.SCI_POSITIONRELATIVE, new IntPtr(bytePos), new IntPtr(1)).ToInt32();
            pos--;
        }
        return bytePos;
    }
    var lineLength = scintilla.DirectMessage(NativeMethods.SCI_LINELENGTH, new IntPtr(line)).ToInt32();
    var ptr = scintilla.DirectMessage(NativeMethods.SCI_GETRANGEPOINTER, new IntPtr(bytePos), new IntPtr(lineLength));
    string s = Helpers.GetString(ptr, lineLength, scintilla.Encoding);
    unsafe
    {
        fixed (char* c = s) {
            return bytePos + scintilla.Encoding.GetByteCount(c, pos);
        }
    }
}

wmjordan commented 7 years ago

I applied the aforementioned patch to Scintilla, recompiled ScintillaNET, modified CharToBytePosition to return 2 * pos; and set the code page to 1200 when the Scintilla control was created. The performance on processing lines with Unicode characters was dramatically improved. However, some method calls appeared to be broken. For instance the line number margin could not be set and the control appeared to be unstable. Nevertheless, the patch had proven that changing the internal storage of Scintilla to UTF-16 Little Endian could obviously improve the performance for Unicode applications on Windows.

jacobslusser commented 7 years ago

I'm not surprised that changing the codepage to UTF-16 made the control unstable, because UTF-16 is not supported by Scintilla. Quoting myself:

UTF-8 is the only Unicode mode supported by Scintilla.

The relevant blurb in the Scintilla documentation can be found here: http://www.scintilla.org/ScintillaDoc.html#SCI_SETCODEPAGE.

Not to be the pessimist, but if you search through our issues list you'll find plenty of people who have come before you and wondered the same things -- about how to improve Scintilla's performance and/or support alternate encodings. The short answer is that, yes, we might be able to squeeze some more performance out of ScintillaNET's UTF-8 (Scintilla's internal format) to UTF-16 (.NET's internal format) conversion, there is really no better option for Scintilla's encoding that UTF-8. That is the only form of Unicode Scintilla supports.

If you have suggestions though for improving the conversion that doesn't involve changing the codepage I'm all ears.

wmjordan commented 7 years ago

Thanks for your reply. I knew that I was not the first one. The encoding issue was so common that almost each Windows programmer who worked with Unicode APIs would encounter.

I commented on the thread there: https://sourceforge.net/p/scintilla/feature-requests/1175/ And the author was reluctant to merge or work on the patch, for it was big, buggy and broken, unless someone else write some better code. He might also probably think that UTF-8 encoding in Scintilla was the best and the coding conversion was trivial stuff.

Recently I also invested some time to take a look at another powerful high-performance source code editor control, AkelEdit, the one hidden in the source code of the famous code editor AkelPad. I did not take too much time to compare the features between AkelEdit and Scintilla. However, it does support Unicode and use UTF-16 as the internal storage, syntax highlighting, code folding, code completion (implemented via a plug-in) were all implemented, and it is programmed in C and it exposes Unicode interfaces. We might be also possible to write a wrapper around it and make another editor control.

jacobslusser commented 7 years ago

Closing this as a known issue. We can re-open if necessary.

jacobslusser / ScintillaNET

Low performance when editing content with long lines and Unicode characters #314