snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Thread-safe version for C# / Moving the state out of the Stemmer instance #146

Open Turnerj opened 3 years ago

Turnerj commented 3 years ago

Currently from the generated code for C#, the stemmer uses an internal state to keep track of position etc. Unfortunately this prevents re-use of the same stemmer instance across multiple threads. Creating a new instance of a stemmer per thread (or in a web application, per request) means the constructor for the stemmer (with all the Among-type arrays generated) occurs frequently, generating the same data, creating additional GC pressure.

In moving the state into the main stem-processing method, it should also allow for the ability to also switch to using Span<char> to do a lot of the work where you have StringBuilder, avoiding further allocations too.

I'm a C# dev, not a C dev, so while I want to help wherever I can, I would be somewhat limited in my capacity to help in the generational side of things.

Turnerj commented 3 years ago

Looking at the compiler more closely, the Among-type declarations could be pushed to a static properties with a static constructor (thus are shared across instances of the same stemmer) which would relieve some pressure from creating new instances of a stemmer.

https://github.com/snowballstem/snowball/blob/8b49140e5cb6675c58bd908dbd4c7e12146a4f72/compiler/generator_csharp.c#L1194

https://github.com/snowballstem/snowball/blob/8b49140e5cb6675c58bd908dbd4c7e12146a4f72/compiler/generator_csharp.c#L1239

ojwb commented 3 years ago

Please open a PR if you want to propose a change - then the CI will test it, and I won't have to ham-fistedly try to interpret what you're suggesting through the lens of my very limited grasp of C#!

Turnerj commented 3 years ago

Hahaha - no problem, I'll give it a shot.

ojwb commented 3 years ago

Cool.

FWIW, making the data structures such as the Among stuff static sounds a good idea to me.

I wonder if you could inline all the snowball routines to give a single C# function so then the cursor, slice ends, limits, etc can just be local variables of that (as indeed could the Snowball variables). Recursion couldn't be handled, but I've never seen a Snowball stemmer which was recursive so in practice that isn't really a limitation. It's a bit ugly to only support a subset of the language for a particular backend though...

Turnerj commented 3 years ago

Yeah, it should be possible - I've seen something similar in someone else's version of the Porter algorithm in C# (theirs was pretty allocate-y as they kept regenerating strings throughout the calculating).

When doing so, it should be possible with maybe 2 or 3 sets of allocations:

Using Span<char> allows us to do slicing to logically remove characters from the start or end of the stemming process without allocations (it sits on the stack and is effectively a pointer with a length property).

In terms of being recursive, that should be possible without issue. We can pass a struct by reference to a recursive version of stem() or whatever function we need to.

ojwb commented 3 years ago

Oh, can you allocate that struct on the stack so it doesn't needs to be GC-ed?

Turnerj commented 3 years ago

Yep! Basically a Span<T> is just a span of memory - a string, an array or a real pointer. It lives on the stack and is specifically designed for high performance/low allocation scenarios.

var myArray = new char[] { 'A', 'B', 'C' };
var mySpan = myArray.AsSpan(); // A, B, C
var frontSlice = mySpan.Slice(0, 2); // A, B
var backSlice = mySpan.Slice(1); // B, C

While the initial array is an allocation that needs to be GC'd, the spans representing the portions of the array are entirely stack-based.

For the stemmer, this is really useful for all trimming logic as you can initially trim off bits you don't need via slicing without creating any allocations. When the stemmer needs to add a character, it would need to create a new char array and copy values to it.

eg.

var mySpan = new char[] { 'A', 'B', 'C' }.AsSpan();
var stem = mySpan.Slice(0, 2);

...

// Add "D" to the span
var tmp = new char[stem.Length + 1];
stem.CopyTo(tmp);
tmp[stem.Length] = 'D';
Turnerj commented 3 years ago

Fun fact though, you could also initialise a character array on the stack too:

Span<char> characters = stackalloc char[3];

This would help avoid even more allocations during processing though you would need to be careful not to allocate too much on the stack.

At the end of the stemmer, you would still have to allocate a final string but that is fine and makes sense.

Turnerj commented 3 years ago

Oh, I should note - the use of Span<T> is a new feature of C# 7 (we're on C# 9 as of last year). With .NET Framework, it requires a dependency on System.Memory (a backwards compat library by Microsoft). With .NET Core/.NET 5, it works out-of-the-box.

Quick background on .NET Framework/.NET Core - basically the future of C# and .NET is .NET Core/.NET 5. The older, slower and more annoying .NET Framework is effectively a legacy platform. It is still "supported" by Microsoft but that is due to it being a component of Windows...

Personally, I think you'll be fine with Span<T> and stackalloc. Anyone including the generated C# stemmer code into their project can easily add the one dependency if they are using .NET Framework.

Turnerj commented 3 years ago

Hmmmm, I looked at making the Among values static though it looks like a few stemmers do something I didn't expect - they have actions that are usually passed in an instance-method.

For example, see this with the Finnish stemmer:

...
                new Among("n", -1, 7),
                new Among("han", 11, 1),
                new Among("den", 11, -1, r_VI),
                new Among("seen", 11, -1, r_LONG),
                new Among("hen", 11, 2),
                new Among("tten", 11, -1, r_VI),
                new Among("hin", 11, 3),
                new Among("siin", 11, -1, r_VI),
                new Among("hon", 11, 4),
                new Among("h\u00E4n", 11, 5),
                new Among("h\u00F6n", 11, 6),
...

Where r_VI and r_LONG are:

        private bool r_LONG()
        {
            if (find_among_b(a_5) == 0)
            {
                return false;
            }
            return true;
        }

        private bool r_VI()
        {
            if (!(eq_s_b("i")))
            {
                return false;
            }
            if (in_grouping_b(g_V2, 97, 246, false) != 0)
            {
                return false;
            }
            return true;
        }

Would require a larger redesign of the generated output to be able to work around that.

ojwb commented 3 years ago

Yes - those are optional conditions on the among cases, used by a few stemmers.

I had an idea for the C stemmer to try generating nested switch statements for among, which I've seen used effectively elsewhere. I don't have a useful patch yet, but this would probably work for C# too, and would mean the among tables would be replaced by generated code at the call site and the Among class wouldn't exist.

ojwb commented 3 years ago

I've merged #147 with a tweak to leave amongs with functions as non-static, which gets us most of the benefit of that change without breaking Finnish, etc.

Turnerj commented 3 years ago

Thanks for the fix for #147 - makes sense to do what you've done. 🙂

I had an idea for the C stemmer to try generating nested switch statements for among, which I've seen used effectively elsewhere. I don't have a useful patch yet, but this would probably work for C# too, and would mean the among tables would be replaced by generated code at the call site and the Among class wouldn't exist.

That would be pretty clever if you work out how to do it and yeah, I don't see why that wouldn't work for C# too.

Might have a look at other optimizations we can do for C#, potentially relating to using nested switch statements - just need to get a bit more of a feel for the data structure of the generator. May need to change some of the Stemmer.cs private methods with some ideas I have. Hopefully with each step like this, we can make it more thread safe while also making it even faster.

ojwb commented 2 months ago

FYI: Snowball's Java generator was recently updated to use a char[] buffer to reduce string allocations and overhead: #195

That might be useful to look at as inspiration for implementing some of the changes discussed above. In particular, as well as providing the existing method to return the result as a string, the underlying buffer can be obtained by the caller instead, which avoids overhead there if they don't particularly want it as a string.