The data structure problem with active objects

sfan5 commented 6 months ago

There's a closely related older issue (#6453) but I felt like a fresh start here was better.

Problem

tl;dr: optimizing anything else server-side is basically a waste until this is solved, skip to the next section.

The nice people from the Your Land server sent me several profiling traces of their production server and a common theme is that about 40% of time (on the server thread) is spent just iterating the entity list (average, not peak!). Now their mods use get_objects_inside_radius a bunch, but this is not a sole cause and I expect any sizeable server will be plagued with this exact issue.

The issue are spatial queries on the entity list, which happen more often than one might think:

Lua get_objects_inside_radius, get_objects_in_area
Entity collision detection (sometimes)
Raycasts (sometimes)
ServerEnvironment::getAddedActiveObjects

All of these are O(n) with n being the number of entities. When you consider that every entity does collision detection this becomes O(n²) right in the main thread.

Solution requirements

What we need is a new data structure.

It must:

support fast lookup by entity ID
be iterable in the order of entity ID
support deletion and insertion
support fast spatial AABB queries (= getObjectsInArea) on the entity position
efficiently handle entity position updates, as these happen often
position updates must reflect in lookups instantly

external circumstances:

there is almost always opportunity for positions to change between lookups
- this means any kind of "one-off cache" (that needs to be thrown away once an update happens) is unlikely to produce any hits
spatial distance queries (= getObjectsInsideRadius) can be emulated on top of AABB queries and are not an important point
being safe from modification during iteration is also important (#13880) but I think we can tack that on if the rest is adressed

Finally we will also need to figure out how to reliably propagate position updates up to the container. There are many ideas in the form of "just add a cache to <thing>" that come to mind, but keeping it in sync is generally the trickier part than designing the structure.

Real-world example data

(thanks to @Bastrabun) There are 4500 entities. About 1400 move often, 3100 are static.

appgurueu commented 6 months ago

I am currently working on a solution to this. The data structure I'm implementing is a dynamization of k-d-trees, meaning it would store O(log n) many k-d-trees with sizes of powers of two. This means:

Amortized O(log n) insertions and O((log n)²) deletions
O(log(n) * n^(2/3) + k) worst case range queries (assuming distinct object positions) where k is the number of results, average case should be better
- To give an example of how good the performance can be, a near-point-query should have O((log n)²) time complexity.
Raycast queries can get special optimization. The basic "bounding area" optimization also helps a bit.

Josiah and I tried optimizing this by using an off-the-shelf R-tree library a while ago. This failed due to bad constant factors for our medium-size workloads. It's also a bit overkill: It is centered around boxes. It would have made supporting higher variance in box dimensions easier. But currently we don't need that; we can just approximate the relatively small objects as points for now.

I'm optimistic that with this new approach, I have a good chance of getting the constant factors low enough since there are plenty of knobs to tweak:

There are three possible ways to construct k-d-trees;
I can use a vector for up to, say, 32 or 64 entity ID - position pairs, and only use k-d-trees for more entries;
There are various ways of tweaking the "dynamization" data structure. For example we could use only a single k-d-tree, which would worsen updates to O(sqrt n log n), but improve queries to O(sqrt n + k) worst case (and much better in average case). This would require k-d-trees to be able to get "out of shape" though, so storing them in arrays wouldn't be an option anymore; we would need more allocations.

ExeVirus commented 6 months ago

I've had an idea tossing around for a while now on the subject, it would need optimized but generally:

Object ID vector of pointers (std vector) Std Vector of unused entries Std unordered map with three levels: Each level represents an Octree subdivision, With the bottom layer corresponding with the 16,16,16 mapblock boundary (to align with when entities need to be put to sleep or awoken). This bottom layer is two std::vector()s to hold unlimited entities, same on-erase mechanics as the main std vector list of objects. So, you have two lookup methods: by Octree bin (basically x,y,z) and by id in a vector. When insertions occur in the vector, they are mirrored in the unordered maps. When positions updates occur, just need to verify we haven't changed mapblock, if we have, erase our entry in that mapblock vector and insert into the new mapblock vector. So the tradeoff is position updates are more expensive. Finally, the main unordered map stores pointers, at least at the highest level, so as to keep RAM usage down. When I last looked at it, I think it was about 15 MB for the whole optimization stack at startup, with a couple kilobyte allocations when the first entities are inserted to mapblocks. Note: will be updating this comment as I do an actual design on the task - might be able to actually do this one. Update 1: Nice we already have an object manager class using a std::map structure with garbage collecting, so that takes care of my first vector manager. Wonder why we need to use maps? Like, couldn't the vector index be the object id? Update 2: The locations that for loop over all objects at ton during runtime are: ```c++ In activeObejctManager.cpp // Get vector sorted by distance around origin point within max distance ActiveObjectMgr::getActiveObjects(const v3f &origin, f32 max_d, std::vector &dest) // Raycasts ActiveObjectMgr::getActiveSelectableObjects(const core::line3d &shootline) // objects inside radius ActiveObjectMgr::getObjectsInsideRadius(const v3f &pos, float radius, std::vector &result, std::function include_obj_cb) // objects inside area ActiveObjectMgr::getObjectsInArea(const aabb3f &box, std::vector &result, std::function include_obj_cb) // Maintain Inactive/Active Objects ActiveObjectMgr::getAddedActiveObjectsAroundPos(v3f player_pos, f32 radius, f32 player_radius, const std::set current_objects, std::vector &added_objects) ```

sfan5 commented 6 months ago

So I took a walk and here's my idea:

It bases around the "one-off cache" approach I said wouldn't work but uses some tricks to make it workable. It also presumes a reliable mechanism exists by which the container is notified when an object changes position. Keep the `ModifySafeMap` as base data structure. A `cache` map is created as a `std::multimap`. A list/vector/set of object IDs is created called `uncached`. **The gist of it** is that objects exist in either `cached` (fast lookup) or `uncached` (linear lookup as before). Since there is no requirement that the cache is accurate at all times we can cleverly avoid (defer) rebuild costs. When an object is inserted it is put into the `uncached` list. When an object is deleted it is removed from the `uncached` list. When an object changes position it is added to the `uncached` list. The cache has to be (re-)created at some point. This should happen when the `uncached` list gets too large, let's say if it's at least 64 entries and >=10% of all objects. When an AABB query happens the `cache` is consulted (which speeds up the query, albeit only in one dimension). The `uncached` map is also consulted and takes priority over what the cache returned for a given object ID. spatial lookup performance: `O(log(n) + m)` (`n=cached.size(), m=uncached.size()`) the cache rebuild that would need to happen once in a while is `O(n * log n)` (n=all objects) --- let's go through the checklist: * lookup by ID: defer to base structure * iterable in ID-order: defer to base structure * deletion and insertion: no problem, described above * spatial queries: yep, this is what the `cache` is for * efficiently handle updates: virtually no cost * updates apply instantly: yes bonus: * safe from modification during iteration: no iterator needs to be held of the two new structures + the base structure is already safe * bookkeeping does not require knowing *from where* objects have moved, just that they have moved caveats: * doesn't work if too many objects constantly move --- If you read closely you will notice that the requirements for the `cache` structure are very minimal. It literally only needs to: * be constructable * supports spatial queries It can be transparently replaced with any other — better — data structure.

Bastrabun commented 6 months ago

Thanks for addressing this problem. If we can help any further, please let me know - whatever readouts, logs, temporary code compiled into the engine ...

For reference YL bugtracker 3723, 6325, 6326, 5289, administration 188

ExeVirus commented 6 months ago

Sfan: Why is just Z-position enough to cache? Does that speed up any of our lookups by position by enough to matter?

Also, why are position updates no cost?

sfan5 commented 6 months ago

Thanks for addressing this problem. If we can help any further, please let me know - whatever readouts, logs, temporary code compiled into the engine ...

I think the most likely next step when someone has working code is for you to test the performance in a real situation. ~~What would be useful to know right now however is: how many active objects do you have usually?~~

Sfan: Why is just Z-position enough to cache?

It may not be, but see the last section.

Also, why are position updates no cost?

After updating an object the cost is paid during every following spatial query, that's the m component in the big-O notation.

ExeVirus commented 6 months ago

Sfan: Why is just Z-position enough to cache?

It may not be, but see the last section.

Oh, that makes perfect sense. Okay yeah I can see a way to get my idea to fit in your world relatively painlessly.

Just take the int16(x) >> 4 | int16(y) >> 4 | int16(z) >> 4. That will put all entities into mapblock sized bins and easily fit in 64 bits.

Should still yield nice speedups for objects in radius, raycast, and whatnot, just gotta make the algorithms for those lookups. Could be as simple as running the same algorithms at 16x16x16 size first and iterate over those relevant bins instead of the whole active object list like we currently do.

Bin size should be played with, obviously. Because 1x1x1 bins wouldn't yield much value I suspect.

fluxionary commented 6 months ago

i tried to take a stab at a better data structure for this. the result, in lua, is here https://github.com/fluxionary/minetest-futil/blob/main/data_structures/point_search_tree.lua. the biggest issue with it is that it can't be re-balanced when new points are added or removed. i found a paper about that but never got around to implementing it https://arxiv.org/abs/1410.5420

appgurueu commented 6 months ago

I also have a Lua implementation of k-d-trees. I think trying to rebalance k-d-trees is a dead end. I tried for a while, it doesn't seem to really be feasible to me.

But it is possible to build an efficient dynamic data structure out of a forest of static k-d-trees (see "transforming static data structures to dynamic structures"), which is what I'm suggesting here and currently implementing. I can't tell whether this will hold up in real world usage, but implementing and testing it is the only way to find out. I'm optimistic that we can get decent constant factors because there are plenty of knobs to tweak.

ExeVirus commented 6 months ago

Thought through sfan's idea more (worked on it). There's also a remove (when object is no longer active) that is also O(log(n) + m)

appgurueu commented 6 months ago

Alright, I've got a seemingly working (according to a quick randomized test against a naive implementation) dynamic forest of k-d-trees data structure whipped up, feel free to take a peek at my branch. The big O for this should be good (though not quite as good as what I advertised initially; for example deletion is amortized O(log(n)²) rather than O(log n) with this implementation), what remains to be seen is how it holds up in practice (potentially after some optimization), and integration into the active object manager, which I may get to on the weekend.

ExeVirus commented 6 months ago

Will take a look, sfans idea was short enough that I just jumped right into server::active object mgr integration, the only not straight forward thing is how to signal object position updates. In mine I just check if last position is different from the new position after step().

I will probably not write tests for your structure directly, as MT is a much better stress test of complexity than I can imagine haha, but I can at least look for obvious things.

Not quite done with my version, but perhaps ready for integration tests against servers/mods by end of week.

ExeVirus commented 6 months ago

Okay, first draft version working correctly? I'm certain I have missed spots where an object position changes and I need to invalidate the cache, such as object:set_pos(), but I have data to back up that I have a working solution, albeit only 10% optimized.

Here is the branch:

I am testing with a custom mod with devtest, this is the init.lua:

```lua minetest.register_chatcommand("foo", { privs = { interact = true, }, func = function(name, param) for i = 1, 100 do local z = i-50 local obj = minetest.add_entity({x = i-50, y = 20, z = z}, "entity_test:test_entity", nil) obj:set_hp(z+1) end return true end, }) local function getObjectsInAreaTest() local before = minetest.get_us_time() local list = 0 for i = 1, 10 do list = minetest.get_objects_in_area({x=0,y=0,z=-50},{x=50,y=50,z=0}) end local after = minetest.get_us_time() minetest.chat_send_all(#list .. " " .. after-before) minetest.log(#list .. "," .. after-before) end local function callEveryHalfSecond() getObjectsInAreaTest() minetest.after(0.5, callEveryHalfSecond) -- Schedule itself to run again in 0.5 seconds end minetest.register_chatcommand("getList", { privs = { interact = true, }, func = callEveryHalfSecond, }) -- Define the test_entity (optional, replace with your desired entity definition) minetest.register_entity("entity_test:test_entity", { description = "Test Entity", initial_properties = { hp_max = 200, physical = true, collide_with_objects = false, collisionbox = {-0.3, -0.3, -0.3, 0.3, 0.3, 0.3}, visual = "wielditem", visual_size = {x = 0.4, y = 0.4}, textures = {""}, spritediv = {x = 1, y = 1}, initial_sprite_basepos = {x = 0, y = 0}, }, on_step = function(self) local pos = self.object:get_pos() pos.z = -50 + math.fmod(minetest.get_us_time()/30000+self.object:get_hp()-51, 100) self.object:set_pos(pos) end, }) ``` 1. I start new devtest world, flat mapgen 2. Go to 0,10,0 3. `/clearobjects` 4. `/foo` (creates the 100 cycling entities) 5. `/getList` (every half second, spits out the number of entities found and how long it took to find them all 10x in a row)

Here is the gif of what that test mod and integration test looks like in-game:

101 Entity Test

Then, I went ahead and plotted the results:

You'll notice that even though there are 101 Entities flying around, the actual time to getObjectsInArea 10 times is varying significantly based on where they are in relation to the getObjectsInArea's AABB.

Take a look at my implementation, I just always check the mapblocks to that max AABB extents, so it's sub-optimal where we are requesting a spherical radius, because I still just check all mapblocks from -radius to radius in all 3 axis centered around the origin point.

Still, surprised it was only about ~80 Loc for the structure. (also notice I set the mapblock size as my bin size. Perhaps smaller than 16, like 8, might be better. I wouldn't know without field testing (and it'll be game specific, actually)

Desour commented 6 months ago

Some feedback / thoughts:

* No need to be perfect in terms of performance. Any solution is acceptable for now, as it will outperform the current situation anyways. * luatic: You could use a bounding volume hierarchies (BVHs) in place of of k-d-trees. A big benefit would be that you can do refitting, instead of removing and inserting an object whenever it moves a tiny bit. For the BVH construction you can, for example, do the same splitting as for k-d-trees, you just have to also keep track of the bounding boxes. You don't have to be worried about the overlap: With k-d-trees you have to make your query AABB bigger by the upper bound of all object volumes. This makes the k-d-tree nodes effectively BVH nodes that always overlap by this upper bound. Actual BVH nodes can't have more overlap than that. * Do we actually currently have an upper bound for object collision/selection boxes? Because IIRC for selection boxes the answer is no. (Edit: Oh right, we're just worried about collision boxes.) Please make sure to properly support big objects, e.g. in the worst case by inserting them into a separate vector if they're too big. * You can do tree rebuilding in a separate thread, to get it out of the hot path. This would go especially well with sfan5's approach (which, if I didn't misunderstand, is basically a non-hierarchical version of what luatic does (aka what the linked paper describes)). * I do like ExeVirus's approach of using a hash map to mapblock-sized boxes. The good thing is that it makes the run time independent of total number of objects, instead it only depends on the number of objects nearby. However, idk yet how the best way would be to make it work well in arbitrary situations, like very big objects (need to update membership in so many mapblocks), very big query boxes (need to look into so many mapblocks), or one mapblock containing many objects (would be good to have a data structure in there too then, but which?). I'm thinking of something like an octree with a hashmap at each depth. But I can't come up with anything hierarchical / working well in generic cases, that is also *simple*.

ExeVirus commented 6 months ago

DS:

You can't see here, but I've made a lot of progress on the hash map solution for us. I'm only focusing on server side right now, so selection boxes aren't a consideration. That said, everything in Minetest must have a position within 2^16 of 0,0,0 which helps tremendously, allowing me to pack everything into buckets of size 16x16x16 nodes. We can revisit later if we want to make this changeable as a setting.

As for big objects, they just have a plain old single origin position, that's all that's needed right now.

My single unordered multi map, is a single layer Octree, and with 10,000 mostly evenly spaced out entities (current benchmark), it's quite performant, haha.

I agree this is a super dead simple solution and gets us good performance compared to what we have (by far). Will post a PR hopefully by end of day.

Performance of the map can absolutely be optimized later, it's just very flexible of a solution.

ExeVirus commented 6 months ago

Alrighty, there's my current merge^^^

Here's the performance results:

performance

Summary for the lazy:

Exactly the performance gains you'd expect from a spatial map: scales well when getting entities in radius/range (which happens for all collisions, I found out).
Has no effect on object creation - it is dwarfed by the cost to allocate memory for the new object

kno10 commented 3 weeks ago

Depending on your data distribution, either a k-d-tree, gridfile or an octree can be a good choice. Here I would assume that a grid based on the map chunks and then an octree within the map chunks can be very elegant and efficient (because the grid corresponds to loading/unloading of chunks, and this level is somewhat "proven coarse enough for many applications"). But for simplicity, a pure octree is also worth exploring (not that "one octree per chunk" is that much more complicated). I doubt that more fancy things such as R-trees, VP-trees, etc. are beneficial here.

ExeVirus commented 3 weeks ago

@kno10

Agree those are the options. I implemented a single-layer octree using std::multimap aligned at the mapblock (16x16x16) boundaries. That PR has gone through multiple reviewers and test rounds. At this point it is technically ready for merge.

However, one of the newer Core Devs, appgurueu, has been working on a dynamic k-d branch, working on it as of even last week.

We could absolutely just merge mine as is and probably get roughly the same performance, but we are waiting for appgurueu to be (a) done, (b) reviewed and (c) tested. That might take a long time.

I still support his efforts, which is why I have not made a stink about merging my implementation. It's simple, does get performance benefits, but this is a case where I'm willing to let great get in the way of good since I'm not a Core Dev.

kno10 commented 3 weeks ago

Speaking of a k-d-tree, one reason why I came to this old thread is biomes. Biome computations are currently linear in the number of biomes, but because of limitations of the current biome API, we see games try to trick the system by registering sub-biomes and variants (in particular, performing y slicing of biomes). At some point, the number of biomes may affect the performance of the biome lookup. A k-d-tree might be an interesting solution to improve the performance; it would likely be built only once at game start, over (currently) only the heat and humidity points of the biomes (i.e., 2d). Here, a k-d-tree is likely better because humidity and heat are more like normal distributed. Maybe @appgurueu can use some of his code for that, too?

minetest / minetest