godotengine / godot-proposals

Godot Improvement Proposals (GIPs)
MIT License
1.17k stars 98 forks source link

Optimize `String` with Small String Optimization (SSO) #11241

Open YYF233333 opened 3 days ago

YYF233333 commented 3 days ago

Describe the project you are working on

Godot Editor

Describe the problem or limitation you are having in your project

String usually spend a lot of time doing heap alloc stuff (in resize). This phenomenon can be easily detected by profiling editor workload (add/remove node, undo/redo, etc) since editor utilizes String a lot.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

As what was discussed in #77158, Small String Optimization (SSO) improve string performance by eliminating allocation for small/short strings.

To prove we do have small strings, I hook String::resize to get the distribution of string size, the results are like below:

Collected from opening project manager and opening an empty project. Note that last column represent all size beyond 64.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Store string shorter than certain size (e.g. 32 bytes) direct in String struct. Only alloc if it grow large enough. Should take care of the COW semantic.

If this enhancement will not be used often, can it be worked around with a few lines of script?

No, it is core.

Is there a reason why this should be core and not an add-on in the asset library?

It is core.

Calinou commented 3 days ago

Remember that Strings in Godot internally use UTF-32 for its fixed width since 4.0, so the threshold of short string optimization would likely be 8 characters for 32 bytes.

I'm surprised to see such a spike of string resize values at 24 and 25 too. Are these values in bytes or character counts?

YYF233333 commented 3 days ago

It is the value passed to String::resize so character counts. The conclusion should be same though.

Print is so slow I haven't test real workloads, will update the results if I find a way.

Ivorforce commented 3 days ago

I was interested in how string concatenation can lead to such performance problems (as discussed in https://github.com/godotengine/godot/pull/77158), so I had a look at the relevant source code myself. I like both all of what @komugi1211s, @SlugFiller and @YYF233333 had to say on the matter. There are a lot of good thoughts here. I, too, stumbled upon some very curious uses of clock cycles in the code.

One example is the use of strlen in a lot of places. Especially, in StringBuilder and StringBuffer both. Do we interact with 20 years old C code so much that we don't have size knowledge? I tracked down the uses of the code. Here's one innocuous one: shader_gles3.cpp. It's passing in a constant string and immediately computing its size in the append implementation. Yikes! But what's that one line below? It's a reference to p_vertex_code, which is passed down into the function from elsewhere. Where's that string coming from? Why, it's from canvas.glsl.gen.h! The strings in this file make up about 40k characters, which are parsed for length not just once, but twice (_setup and _add_stage)! And then anew for each shader that's created!

There are a few other strange choices, for example the use of 3 different (COW, no less) Vectors in StringBuilder instead of a single one, leading to (almost) thrice the amount of RAM allocations and opening the code up to cache contentions. Not to mention re-allocating these every few calls, leading to even worse caching problems — even though most often, the exact strings to be concatenated are well-known at compile time.

No wonder string handling in Godot is so slow! I feel like if we want to make actual performance improvements to string related code, we should:

I think SSO is a great idea too! It may help with cache contentions and unnecessary allocations. But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.

SlugFiller commented 3 days ago

I agree with most of what @Ivorforce says, with one exception:

  • Start using std::variant in StringBuilder to avoid problems with the rest of cases where the number of strings is not known at compile time, to optimize the cache contentions and RAM waste.

This would reduce the number of Vectors, but wouldn't change the fact that you're using a Vector, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.

Also, a small issue with not using const char* is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.

clayjohn commented 3 days ago

I think this is a good idea. In my profiling I found that StringBuffer was still slightly faster than using String directly, even for larger strings like what are used in the shader compiler. I suspect it is due to StringBuffer explicitly Po2 resizing the String ahead of time. When using String directly we do end up doing Po2 allocations since it uses Vector internally, but there is a lot of bookkeeping around it. By manually doing a Po2 resize, we skip that bookkeeping.

Ivorforce commented 3 days ago

This would reduce the number of Vectors, but wouldn't change the fact that you're using a Vector, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.

I saw your tests, and was impressed by how much of a difference it made! I think it will be less pronounced with the giant ~20k char strings we're dealing with in the shader code though, since every reallocation thereafter would lead to ~20kb RAM alloc and dealloc. But I'm open to test!

Also, a small issue with not using const char* is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.

Luckily it is possible! Here's some code for that:

    template <size_t len>
    UnownedString(const char (&p_cstring)[len]) {
        return UnownedString { p_cstring, len };
    }
SlugFiller commented 3 days ago

every reallocation thereafter would lead to ~20kb RAM alloc and dealloc

I've previously explained in the RocketChat, but this only applies for WebAssembly, since it uses a custom memory allocator (provided by Emscripten). Every platform that uses a platform-provided reallocator, or otherwise has access to memory mapping, will not actually deallocate the memory before reallocating.

Instead, it will simply remap all the pages other than the first and last ones into new addresses, to create the illusion of a continuous span of memory addresses, even though it's only continuous in virtual address space.

If a page size is 4kb (the default for Linux), then a ~20kb realloc is actually just 5 page moves, each being hardly more expensive than a function call.

Ivorforce commented 3 days ago

Very interesting! I definitely want to test this. But yes, I agree that means there's all the more reason to question the current setup of the code.

YYF233333 commented 3 days ago
  • Add a template <typename... Args> concatenate_strings(Args... args) { ... } function that uses the stack to join input strings, avoiding unnecessary allocation.

Interesting idea, I think that should be paired with some stack-based string (e.g. std::array)?

Actually, I do have thought about move all short string to stack (and facing the risk of stackoverflow :) ).

But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.

There are, actually tons of them. For example just take a look at vformat and String::sprintf. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.

Ivorforce commented 3 days ago

Interesting idea, I think that should be paired with some stack-based string (e.g. std::array)?

typename... Args is needed to support mixed-type inputs (String and char* / UnownedString). std::array can only handle same-type elements.

There are, actually tons of them. For example just take a look at vformat and String::sprintf. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.

Yeah, I can see how those uses would be intimidating to adjust, 'just' for a bit of String optimization. I do think it would be doable, but it would involve quite a bit work.

YYF233333 commented 3 days ago

By the way, should we change the title? I think the discussion has been far beyond SSO.

Ivorforce commented 3 days ago

We diverged a bit because it is an interesting topic, but I think it would be best to keep this proposal for SSO and branch off to other proposals for other changes. I have opened #11249 for the strlen calls, and we have https://github.com/godotengine/godot/pull/77158 for StringBuilder vs StringBuffer (although it may be worth it to add a proposal for this one as well, as its merge seems to have been stalled for quite a while due to its implications, and may need more data / a clear write-up to convince maintainers for a merge).