Open YYF233333 opened 3 days ago
Remember that Strings in Godot internally use UTF-32 for its fixed width since 4.0, so the threshold of short string optimization would likely be 8 characters for 32 bytes.
I'm surprised to see such a spike of string resize values at 24 and 25 too. Are these values in bytes or character counts?
It is the value passed to String::resize
so character counts. The conclusion should be same though.
Print is so slow I haven't test real workloads, will update the results if I find a way.
I was interested in how string concatenation can lead to such performance problems (as discussed in https://github.com/godotengine/godot/pull/77158), so I had a look at the relevant source code myself. I like both all of what @komugi1211s, @SlugFiller and @YYF233333 had to say on the matter. There are a lot of good thoughts here. I, too, stumbled upon some very curious uses of clock cycles in the code.
One example is the use of strlen
in a lot of places. Especially, in StringBuilder and StringBuffer both. Do we interact with 20 years old C code so much that we don't have size
knowledge? I tracked down the uses of the code. Here's one innocuous one: shader_gles3.cpp. It's passing in a constant string and immediately computing its size in the append
implementation. Yikes!
But what's that one line below? It's a reference to p_vertex_code
, which is passed down into the function from elsewhere. Where's that string coming from? Why, it's from canvas.glsl.gen.h! The strings in this file make up about 40k characters, which are parsed for length not just once, but twice (_setup
and _add_stage
)! And then anew for each shader that's created!
There are a few other strange choices, for example the use of 3 different (COW, no less) Vectors
in StringBuilder
instead of a single one, leading to (almost) thrice the amount of RAM allocations and opening the code up to cache contentions. Not to mention re-allocating these every few calls, leading to even worse caching problems — even though most often, the exact strings to be concatenated are well-known at compile time.
No wonder string handling in Godot is so slow! I feel like if we want to make actual performance improvements to string related code, we should:
char*
(and strlen
especially...) throughout the codebase, replacing with a little struct UnownedString { const char * string, size_t length }
and pass those around.
String
and StringName
too.template <typename... Args> concatenate_strings(Args... args) { ... }
function that uses the stack to join input strings, avoiding unnecessary allocation.std::variant
to linearize memory use in StringBuilder
, avoiding problems with the rest of cases where the number of strings is not known at compile time, to reduce cache contentions and RAM waste.I think SSO
is a great idea too! It may help with cache contentions and unnecessary allocations. But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.
I agree with most of what @Ivorforce says, with one exception:
- Start using
std::variant
inStringBuilder
to avoid problems with the rest of cases where the number of strings is not known at compile time, to optimize the cache contentions and RAM waste.
This would reduce the number of Vector
s, but wouldn't change the fact that you're using a Vector
, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.
Also, a small issue with not using const char*
is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.
I think this is a good idea. In my profiling I found that StringBuffer was still slightly faster than using String directly, even for larger strings like what are used in the shader compiler. I suspect it is due to StringBuffer explicitly Po2 resizing the String ahead of time. When using String directly we do end up doing Po2 allocations since it uses Vector internally, but there is a lot of bookkeeping around it. By manually doing a Po2 resize, we skip that bookkeeping.
This would reduce the number of Vectors, but wouldn't change the fact that you're using a Vector, with reallocations, in the first place, in which case it's better too have a buffer with reallocation instead. For the simple reason that if the number of allocations during append is the same, then the reallocating buffer is actually, over all, one allocation less, because it removes the final allocation for the complete buffer.
I saw your tests, and was impressed by how much of a difference it made! I think it will be less pronounced with the giant ~20k char strings we're dealing with in the shader code though, since every reallocation thereafter would lead to ~20kb RAM alloc and dealloc. But I'm open to test!
Also, a small issue with not using const char* is that C/C++ does not offer a string constant with precalculated length. I'm not sure if it's possible to force the length to be calculated at compile time.
Luckily it is possible! Here's some code for that:
template <size_t len>
UnownedString(const char (&p_cstring)[len]) {
return UnownedString { p_cstring, len };
}
every reallocation thereafter would lead to ~20kb RAM alloc and dealloc
I've previously explained in the RocketChat, but this only applies for WebAssembly, since it uses a custom memory allocator (provided by Emscripten). Every platform that uses a platform-provided reallocator, or otherwise has access to memory mapping, will not actually deallocate the memory before reallocating.
Instead, it will simply remap all the pages other than the first and last ones into new addresses, to create the illusion of a continuous span of memory addresses, even though it's only continuous in virtual address space.
If a page size is 4kb (the default for Linux), then a ~20kb realloc is actually just 5 page moves, each being hardly more expensive than a function call.
Very interesting! I definitely want to test this. But yes, I agree that means there's all the more reason to question the current setup of the code.
- Add a
template <typename... Args> concatenate_strings(Args... args) { ... }
function that uses the stack to join input strings, avoiding unnecessary allocation.
Interesting idea, I think that should be paired with some stack-based string (e.g. std::array
)?
Actually, I do have thought about move all short string to stack (and facing the risk of stackoverflow :) ).
But it may be worth it to check first where all these small strings are allocating from, to see if there aren't more fundamental problems that should be resolved first.
There are, actually tons of them. For example just take a look at vformat
and String::sprintf
. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.
Interesting idea, I think that should be paired with some stack-based string (e.g.
std::array
)?
typename... Args
is needed to support mixed-type inputs (String
and char*
/ UnownedString
). std::array
can only handle same-type elements.
There are, actually tons of them. For example just take a look at
vformat
andString::sprintf
. They just lay too deep, changing them you end up with changes everywhere. I do not dare to touch that, but yes I'm willing to see refactors in these place.
Yeah, I can see how those uses would be intimidating to adjust, 'just' for a bit of String optimization. I do think it would be doable, but it would involve quite a bit work.
By the way, should we change the title? I think the discussion has been far beyond SSO.
We diverged a bit because it is an interesting topic, but I think it would be best to keep this proposal for SSO
and branch off to other proposals for other changes. I have opened #11249 for the strlen
calls, and we have https://github.com/godotengine/godot/pull/77158 for StringBuilder
vs StringBuffer
(although it may be worth it to add a proposal for this one as well, as its merge seems to have been stalled for quite a while due to its implications, and may need more data / a clear write-up to convince maintainers for a merge).
Describe the project you are working on
Godot Editor
Describe the problem or limitation you are having in your project
String
usually spend a lot of time doing heap alloc stuff (inresize
). This phenomenon can be easily detected by profiling editor workload (add/remove node, undo/redo, etc) since editor utilizesString
a lot.Describe the feature / enhancement and how it helps to overcome the problem or limitation
As what was discussed in #77158, Small String Optimization (SSO) improve string performance by eliminating allocation for small/short strings.
To prove we do have small strings, I hook
String::resize
to get the distribution of string size, the results are like below:Collected from opening project manager and opening an empty project. Note that last column represent all size beyond 64.
Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams
Store string shorter than certain size (e.g. 32 bytes) direct in
String
struct. Only alloc if it grow large enough. Should take care of the COW semantic.If this enhancement will not be used often, can it be worked around with a few lines of script?
No, it is core.
Is there a reason why this should be core and not an add-on in the asset library?
It is core.