push_back / insert vector element into the same vector

lawnjelly commented 5 years ago

This is somewhat of a duplicate of #16961 but I wanted to create a fresh more focused discussion, as that has a lot of distracting info discussing the particular symptoms in one case, but well done to Zylann for identifying the bug. It has remained unfixed for some time for such a potentially important bug.

Also see #29408 #31159 And PR #31609 and #31694

Issue description:

When doing certain operations on vector, push_back, insert, possibly append, problems can occur when the argument is also an element of the vector. This is because the vector may resize, and invalidate the memory location of the argument passed.

i.e.

vector v contains <34, 25>

v.push_back(v[0]);

-> v resizes to e.g. 4, copies 34, 25 to new memory address, frees the old memory
(which may be filled with garbage / written over)
-> copies garbage to v[2]

This has historically been a problem with STL vector as well, there is some discussion here as to whether it is required to be safe as part of the standard, and what implementations do: https://stackoverflow.com/questions/18788780/is-it-safe-to-push-back-an-element-from-the-same-vector

Bear in mind when coming to a solution we want to consider the speed implications in worst case scenarios, i.e. are we paying a cost in situations which don't even copy from within the same vector, because it may be used in tight loops with millions of iterations etc, or for large size data structures, or for complex copying.

With STL they were somewhat constrained by history, they may have had to make it work in all situations. On the other hand given that the godot vector is just used within godot, we have a little more leeway in considering solutions.

Some possibilities off the top of my head (I may well be missing some other obvious ones)

1. When resizing the array, don't delete the old array until the copy has completed

This is a nice solution, it would entail making the resize a two part process, so care would have to be taken in the implementation.

2. Always make a temporary copy of the argument

You are paying the price even when this isn't required

3. Only make a temporary copy of the argument when resizing is needed

A lot cheaper than 2, and compared to the cost of allocation / copying when resizing, the cost of a temporary would be small

4. Detect whether the argument is within the memory of the array, only make a copy if it is

Could be added as a condition to 3

5. Prohibit copying within a vector (leave it to the calling code to make the separation)

The fastest solution (tied with 1). We could for example in debug build of godot put in an assert check to make sure argument was not within the array memory. And use a slower version for calls from scripts. However, the difference in performance versus the ease of use of the other approaches may not be worth it, and 1 would be similar speed and solve the problem.

Note that a lot of these are tied in with the resizing, which profan currently has a PR in for #31694 It may be an idea to fix this as part of the same PR as they are somewhat interlinked.

oeleo1 commented 4 years ago

Is there any forthcoming consensus or work in progress here? Currently we experience random crashes and therefore a solution is needed for 3.2. Maybe a minimal & safe one shall go in if a full featured fix is too demanding for 3.2. PR https://github.com/godotengine/godot/pull/31609 works for me.

lawnjelly commented 4 years ago

Is there any forthcoming consensus or work in progress here? Currently we experience random crashes and therefore a solution is needed for 3.2. Maybe a minimal & safe one shall go in if a full featured fix is too demanding for 3.2. PR #31609 works for me.

I think last time we talked about this area the consensus was that rather than modify the existing Vector too much (in terms of allocation strategies) we should create a new (additional) vector that more closely resembles STL vector in behaviour. I don't know if anyone is currently working on this.

This does however mean that we would need a separate fix for the existing Vector (and any other classes this bug also affects). I could fix this if no one else is working on it, but fair warning it will probably slow down the existing Vector. The way to fix this properly imo without impacting performance would be to use a different allocation strategy. I'll try and do some performance testing of just enforcing a copy or a bounds check on the existing Vector (maybe today?).

If this is too expensive then yes #31609 might do as a stop gap, but it is not ideal as there may have been other similar accesses merged since, and users could also make unsafe calls in scripts / modules etc.

lawnjelly commented 4 years ago

There is also possible a compromise, we could make the existing push_back and other methods safe, then add alternative fast versions that could be used in anything speed critical that could guarantee not to violate the requirement of adding from within.

Then, if we in future did change the allocation strategy we just remove the fast methods, recompile and fix up the compile errors to point to the old method.

oeleo1 commented 4 years ago

Yes, the priorities are 1/ safety, then 2/ speed. Currently there is a big breach in safety.

lawnjelly commented 4 years ago

I've just done some initial performance testing (in place with the existing vector, rather than a testbed, and using a module compiled with -O3). As with all of this kind of testing it is subject to artefacts, but I compared 4 variations of push_back:

old version
always copy by value via a temporary
testing if memory address is within the vector, if within use copy via a temporary, else the old method
RandomShaper's version (similar to 3)

I tried this for both small (4 bytes, 2000000 push_backs) and large (1 megabyte, 20 push_backs) structures, and pushed values from test data that were outside the destination vector. And I used multiple runs for each and summed the results, discarding the first run.

Small allocations

Large allocations

These new measurements now largely match the predictions. The extra testing of memory addresses increases the cost in small allocations, but in large allocations the cost of always making the copy makes approach (2) slower.

oeleo1 commented 4 years ago

@lawnjelly This sounds good, but what exactly 1. old version means ? If this means the current buggy code then approaches (1) and (3) are out. If not and (1) is safe, and assuming the performance is comparable cf. your tests, I'd say code simplicity between (2) and (3) shall prevail.

lawnjelly commented 4 years ago

RandomShaper has already done a fix, it's version (3) from my second list I think from a brief look. :+1:

(3) is fine because it does a check to see whether the push_back argument is within the vector. If it is, it makes a copy (so no problem), and if it isn't then no need to make a copy (so no problem).

Sorry rather confusingly, version (3) from my testing is option (4) in the original list at the top post.

RandomShaper commented 4 years ago

@lawnjelly, I've implemented 4.

Your findings about performance are interesting. I've also found some things in Vector that I don't quite like, like that set(..., get(...)) in a loop when inserting.

Regarding the reallocation behavior, I think most implementations have some allocation intelligence, but we should not rely on that, not being a standard thing. Furthermore, that would involve keeping calling the standard allocation function so it does its bookkeeping, which would be less optimal than having Vector taking some care itself.

oeleo1 commented 4 years ago

@lawnjelly Would it make sense to run your tests with @RandomShaper 's PR https://github.com/godotengine/godot/pull/34266 and confirm/refute performance stability? I'll run my game with it too to check how it goes -- it used to crash systematically without a patch.

lawnjelly commented 4 years ago

Getting a little off topic here now .. after some investigation I'm now not so sure the Godot vector is as bad as I initially feared (I did some quick profiling to compare with stl vector). :+1: One big problem for me personally is the source code is rather sparsely commented, and the interaction between cowdata and vector is far from clear. This could really do with being improved imo given how central vector and cowdata is to the engine.

The other thing is that the responsibility between the 'vector' behaviour of vector and cowdata seems blurred. If I understand it right, most of the vector behaviour is actually occurring in cowdata (the 2x allocations). This may be what you are getting at RandomShaper?

In cowdata resize, somewhat easy to miss (comments?) it seems to do a power of 2 to determine the allocation size to make (which is something you'd more expect in vector), then it relies on the OS / runtime realloc to deal with noticing that the allocation size is identical in most push_backs:

(this is some test logging of cowdata realloc push_back with a vector of 32 bit values)

realloc to p_size 1,    alloc_size 4
realloc to p_size 2,    alloc_size 8
realloc to p_size 3,    alloc_size 16
realloc to p_size 4,    alloc_size 16
realloc to p_size 5,    alloc_size 32
realloc to p_size 6,    alloc_size 32
realloc to p_size 7,    alloc_size 32
realloc to p_size 8,    alloc_size 32
realloc to p_size 9,    alloc_size 64

It also seems to rely on realloc to copy most of the data about if necessary. I couldn't quite decide whether the power of 2 in cowdata was some kind of method to deal with fragmentation / make allocations easier, or whether it was to prevent over frequent allocation from classes that used cowdata. If the latter, I was left wondering why not just have classes that use cowdata use vector instead, and put the power of 2 in vector? Maybe there is good reason though it can be difficult to guess these things.

This is just my interpretation though, it may be wrong. Really I can't guess the intended method, which is where comments can help.

Anyway this suggests that the vector reallocation / copying may not have been the reason for similar timings after all - I'm now thinking towards them being either an artefact, or memory access being the rate limiting step (i.e. the CPU has plenty of time to spare waiting for cache, so can do things like make copies and check memory addresses for free).

lawnjelly commented 4 years ago

@lawnjelly Would it make sense to run your tests with @RandomShaper 's PR #34266 and confirm/refute performance stability? I'll run my game with it too to check how it goes -- it used to crash systematically without a patch.

Yup sure I'll have a look tomorrow. I'm sure they will be fine though. :+1:

oeleo1 commented 4 years ago

PR https://github.com/godotengine/godot/pull/34266 works for me. Good job!

Re: cowdata realloc, the roundup to a power of 2 is not a big surprise. Statistically power of 2 rules for mallocs which aim at being very fast without causing a lot of fragmentation and without having additional knoweldge of the allocation patterns. However in our case, requesting a power of 2 from the underlying allocator makes it effectively allocating a touch more, the additonal bytes being for internal block management/marking. Overall this results in allocating more than the requested power of 2 and obviating all the care the power of 2 is used for at the user level. But since we don't know anything about the underlying allocator and its block overhead needs, better stick to a simple scheme and the roundup to a power of 2 is a natural choice.

lawnjelly commented 4 years ago

I also noted that as a consequence of using realloc, are we implicitly not supporting user-defined copy constructors? Presumably the stl version doesn't use realloc.

http://www.stroustrup.com/bs_faq2.html#renew

However in our case, requesting a power of 2 from the underlying allocator makes it effectively allocating a touch more, the additonal bytes being for internal block management/marking.

The storing of the size as part of the block is an interesting way of keeping the memory use of a blank vector to a minimum, at a slight cost to readability and the possibility of cache misses in some circumstances. The maximum size seems to be kept implicitly in the runtime via the realloc.

lawnjelly commented 4 years ago

Runs fine on mine! Quick performance test, I compared the new RandomShaper PR push_back with:

int s = vec.size();
vec.resize(s+1);
vec.set(s, value);

This should (I hope) roughly emulate the old version. In release_debug, with test module compiled with -03 I'm getting time taken:

New method : 802 
Old method : 734

This is repeated timings of 2000000 push_backs of vector of 32 bit value.

I've realised why the earlier timings I did may well have been invalid and likely why they showed no difference, I will try them again later. My IDE was running the debug IDE when I had been compiling the release IDE doh!!

This does show the new method as being slightly slower, which is what we would more likely expect.

RandomShaper commented 4 years ago

This does show the new method as being slightly slower, which is what we would more likely expect.

Hmm, I didn't expect such a big difference (~10% slower) for that test, since resize just checks the backup parameter and sees it's NULL.

I also noted that as a consequence of using realloc, are we implicitly not supporting user-defined copy constructors?

Good point. While it's true that _copy_on_write checks if the type is trivially copyable or not, the reallocation happens later and I don't think it's correct to move objects if they are not trivially copyable, so resize would need to check again. A reasonable solution would be that _copy_on_write had the option to do it's copy-constructor-aware copy to a differently sized memory region. More cases to be considered, though.

lawnjelly commented 4 years ago

Good point. While it's true that _copy_on_write checks if the type is trivially copyable or not, the reallocation happens later and I don't think it's correct to move objects if they are not trivially copyable, so resize would need to check again. A reasonable solution would be that _copy_on_write had the option to do it's copy-constructor-aware copy to a differently sized memory region. More cases to be considered, though.

I actually feel this is one of those cases that sticking to the letter of the c++ law can bite you. I would be tempted to just disallow such things, as trying to move such non-relocatable objects could be causing hidden performance issues. Some other structure like a linked list might be more appropriate (maybe with a vector index).

Certainly worth mentioning in the comments for the class, or even better flagging a compilation error if this is possible.

lawnjelly commented 4 years ago

This does show the new method as being slightly slower, which is what we would more likely expect.

Hmm, I didn't expect such a big difference (~10% slower) for that test, since resize just checks the backup parameter and sees it's NULL.

It could be the jiggery pokery with the backup and the actual checking of the memory address that is slowing it, in the case of small allocations. Copying 4 bytes via a register is virtually nothing, but copying 64 bit pointers and pointer comparisons may be slower than just copying the value just in case. On the other hand, with large allocations checking the memory address seems to really pay off.

See the updated timing data earlier in the thread. It is giving far more sensible results now.

I'd encourage independently testing this though, because timing alternative methods is notoriously difficult to get right, it often takes many successive attempts to prevent the compiler optimizing out the test (as me and Puthre found when examining inverse square roots), so always treat timing info with a healthy dose of scepticism I think, especially mine! :smile:

I'll try and find a crossover point where testing the address is cheaper, out of interest. And in any case, as @oeleo1 says I think we need to go for safety even if it is slightly slower than the original method.

RandomShaper commented 4 years ago

Good points.

Aside, I think that for 3.x fixing the auto-insert would be enough, especially if it helps fixing some related bugs.

oeleo1 commented 4 years ago

A slowdown of less than 10% for 2M push_backs is perfectly fine considering 1/ the criticality of the issue and 2/ this is the worst case scenario, knowing that push_backs are not a common path for the majority of use cases. I don't think 3.2 deserves any more optimizations and the fix is good as is.

lawnjelly commented 4 years ago

I'm happy with the fix as is too. But it is worth noting that for small allocations it is viable to swap the implementation at compile time, because it's a template, it should get compiled out:

i.e.

const SOME_THRESHOLD = 32;
if (sizeof (T) <= SOME_THRESHOLD)
{push_back_safe_value}
else
{push_back_safe_check_addr}

Also note this testing was with a pod type, if you had something with an expensive user defined copy constructor this might not be a performance win. But then copy constructors aren't supported in the godot vector (see earlier).

pod types

It reminds me of another related point to beware of using vectors (and related) : simple things like constructors are not necessarily free. For instance every time you push_back a Vector3 it potentially first runs the constructor (settings x, y, z to 0) then it copies the data to the Vector3.

In the case of a push_back it is possible that the compiler could optimize out the constructor (I haven't tested this) but in the case of a resize it is unlikely. For this and similar reasons in my own code I tend to use pod types for simple structures and initialize them explicitly when required. This is a safety versus performance thing.

Also note that with a zeroed data type (e.g. in godot Vector3) you can potentially more efficiently resize lots of new elements by reserving the memory, not running the constructor and memsetting to zero. (not with anything with a vtable pointer though)

copy on write

Another improvement particularly that I think RandomShaper referenced with the set / get loop when inserting:

If you are adding a large number of elements, there is no need to do the copy on write check on every set. It makes more sense to do the copy on write once, then have a cheaper set call. Of course with great power comes great responsibility, more opportunity for bugs from the caller if they use such an optimized version. And there are plans to add a non-copy on write vector in Godot 4, which will sidestep the problem, however it still might be worth using a non-cow set from within the old vector itself, which still leaving it safe to outside callers.

In the case of the set/get loop when inserting, there is an even simpler solution: Use memmove.

memmove

As we are already disallowing user defined copy constructors, it is safe to simply copy the memory as a block. memmove is designed to deal with the situation of overlapping buffers (as opposed to memcpy).

lawnjelly commented 4 years ago

Well this has been quite difficult to test, partly because it's inside Godot. I did try removing the relevant parts of core to compile them in a testbed but there are a lot of interdependencies. Also I could only compile with release_debug as the editor doesn't compile with release I think. However the test module (which used the vector template) was compiled with -O3.

There was quite significant variation between runs, maybe partly because the vector does much than just add the value (the allocation, potential copying which are subject to random variation), and maybe other stuff the OS was dealing with at the same time as running. I could remove that first factor in a testbed, but then that could conceivably be important in determining which works best 'in place'.

I also tried the unlikely macro on a suggestion from TMM. This didn't have a massive effect for me, but it may be compiler / system dependent (as are all these results).

I didn't find a massive amount in it between the methods. I think there's some evidence that a copy by value without a check may be slightly faster (< 5% faster ?) with sizeof (T) at around 16 bytes and below (maybe strings?). You could put in a compile time check for this and select the method according to sizeof (T), but it's probably not worth it unless we identify it as a bottleneck later on.

What I did find was that:

RandomShaper's was consistently a 2-3 percent slower than the (very simple) testing I did for the address check, I'm not sure exactly why at this stage, RandomShaper's is doing a fair bit more though.

There is some evidence that the old trick of combining the address checks into 1 is slightly faster than doing 2 independent checks.. having the range below the start wrap to a very high value by unsigned math i.e.

    // test whether p_elem is within the vector
    uintptr_t addr_start = (uintptr_t) _cowdata.ptr();
    uintptr_t addr_size = size() * sizeof (T);
    uintptr_t addr_elem = (uintptr_t) &p_elem;

    addr_elem -= addr_start;

    // most common case, outside vector
    if (addr_elem >= addr_size)
    {
        return push_back_unchecked(p_elem);
    }

    // less common case, copying within the vector, needs an intermediate
    return push_back_safe_value(p_elem);

Again, your mileage may vary. Or I might have made a mistake in that code lol. :smile:

Really though this is nitpicking, there's no massive advantage of any, so imo I think any of these approaches would be fine. :+1:

oeleo1 commented 4 years ago

@lawnjelly Looks like you had quite some fun here showing that at the end of the day further optimizations are in the noise area. Good to know your test results confirm that.

If there is consensus that @RandomShaper PR is safe, simple enough and does not stand in the way for 3.2 and 4.0, which is also my opinion, let’s say we’re done and let @akien-mga get it in pending approval.

RandomShaper commented 4 years ago

I'll add the unlikely to my PR.

akien-mga commented 4 years ago

Some of the most pressing issues should be mitigated by #34618, but many of the concerns raised are planned to be addressed further for 4.0.

lawnjelly commented 4 years ago

Unless anyone objects, closing this for now as it should be fixed by #34618.

oeleo1 commented 4 years ago

No objection. Tested my project which used to crash with beta5 - all is well. Thanks!

godotengine / godot