Optimize Rotations::rotate() more

JATothrim commented 1 year ago

Bunch of small optimizations I made:

Hoist nearly all allocations out from the hot loop of checking the rotations.
Add out argument as Rotations::rotate(...,Cube&) and make it return std::pair<XYZ,bool> the second==true if rotation is valid. This avoids allocating memory every time in rotate().
Update/Fix test_rotations.cpp to work with changed rotate()
Cube::empty_from() added to clear cube and reserve enough space.
clang-format the sources
Misc fixups to the cubes code.

this: N = 12 || generating new cubes from 2522522 base cubes. 66 sets by shape for N=12 converting to vector starting 2 threads done took 98.79 s [ 0, 1261261] done took 114.17 s [1261261, 2522522] num cubes: 18598427 ./build/cubes -n 12 -t 2 245.05s user 1.88s system 180% cpu 2:16.76 total versus: N = 12 || generating new cubes from 2522522 base cubes. 66 sets by shape for N=12 converting to vector starting 2 threads done took 119.84 s [ 0, 1261261] done took 139.10 s [1261261, 2522522] num cubes: 18598427 ./build/cubes -n 12 -t 2 297.87s user 2.04s system 180% cpu 2:46.07 total

N=12 is is biggest that I can reasonably test and most of time is still spent sorting the Cube rotations...

My idea was that less we do memory allocations, the better multi-threading will scale, since they are not stuck in the memory allocator. I have follow up changes that require this patch in-order to reduce the allocations even further.

NailLegProcessorDivide commented 1 year ago

I havent looked into the code or done any profiling at all but this patch seems to significantly decrease performance at least on my machine for some reason at both 2 and 24 threads. Your performance seems a lot better somehow.

measuring system time of the full run using time ./cubes -n 12 -t 2 gives me: 24t main 15s 24t patch 2m21.721s and 2t main 1m16 2t patch 11m26

JATothrim commented 1 year ago

24t main 15s 24t patch 2m21.721s and 2t main 1m16 2t patch 11m26

Did you build it with Release config? That looks like Debug build, or something is very wrong with my code...

NailLegProcessorDivide commented 1 year ago

how do you specify the config in cmake/make. Ive been trying to figure it out for 20 mins and all its done is remind me why I never use cmake. I ran cmake .. && make for both yours but if you changed the default in your patch then that could be the difference

bertie2 commented 1 year ago

hi yes, good point @NailLegProcessorDivide the correct way to build release in cmake is

cmake .. -DCMAKE_BUILD_TYPE=Release

but this is in fact down in the weeds cmake nonsense, i will make release the default and add how to build debug in the README

NailLegProcessorDivide commented 1 year ago

thanks I just got it, I was missing the -D. new numbers 24t origin/main 14.9s 24t 2e7f9d5 12.2s and 2t origin/main 1m14 2t 2e7f9d5 1m1 so I now see a reasonable improvement

JATothrim commented 1 year ago

@bertie2 Yeah, I goofed earlier with default build type not being set. Thanks for fixing it.

JATothrim commented 1 year ago

I added more commits to this pull request: These try reduce the allocations even more by requiring Cube size to be known at constructor. Cube is now reduced to std::array object like that has fixed size set at constructor.

I'm not too sure if the interface changes are okay as there is no emplace_back() anymore?

Benchmark results would also be nice.

NailLegProcessorDivide commented 1 year ago

24t d027daa 12.2s 2t d027daa 56.9s looks like some improvement when less threads but also beware this is run on on a desktop with a bunch of random stuff running using time for timing so isnt exactly the most accurate numbers you could get

nsch0e commented 1 year ago

My idea was that less we do memory allocations, the better multi-threading will scale

@JATothrim since most of us are on 64bit systems pointers and size_t are relatively big compared (8 byte) to an XYZ (3 byte). Perhaps a fixed array size of like 20 XYZs as a member of Cube with only an uint8_t as size could dramatically reduce ram size and less heap allocations. --> Would limit N=20 but I think it will be a while for that to be a problem... Could also be a compile option/define.

NailLegProcessorDivide commented 1 year ago

@nsch0e the -m point-list implementation in the rust version that is still the fastest and lowest memory usage as far as im aware takes this idea slightly further and stores a fixed array of 16 by 5 bit x, y, z's packed into uint16_t's which will save a further 50% of the memory at the cost of having a maximum index of 31. The length is then also implicit from the program state and which hashset it is in rather than needing to be actually stored. I agree the N=20 for your example and 16 for mine is a bit of a concern but outside of making a custom hashset implementation that has dynamically sized run time the best ive seed so far is n=14 and that still took quite a lot of ram.

JATothrim commented 1 year ago

My idea was that less we do memory allocations, the better multi-threading will scale

@JATothrim since most of us are on 64bit systems pointers and size_t are relatively big compared (8 byte) to an XYZ (3 byte). Perhaps a fixed array size of like 20 XYZs as a member of Cube with only an uint8_t as size could dramatically reduce ram size and less heap allocations. --> Would limit N=20 but I think it will be a while for that to be a problem... Could also be a compile option/define.

For multi-threading purposes the amount of memory allocated doesn't matter too much, but the number of malloc/free operations will as global heap will always be contention point. Viewing expand() from this point the XYZSet is horrible doing alloc for every XYZ inserted.

I once had idea to hoist candidates, newCube, lowestHashCube, rotatedCube out from expand() into their own structure that is re-used for each expand() . This would pretty much eliminate all allocations (except Hashy::insert()) in expand()

Using std::array<XYZ, CUBES_MAX_N> and uint8_t for Cube makes sense. This would both get rid of layer of indirection and reduce memory use (when is N near CUBES_MAX_N). However, the current loop structure relies on Cubes begin cheap to move, so I'll have test this.

Should CUBES_MAX_N be configurable with cmake?

For single core performance, if somebody would mind and try write fast (radix?) sort for XYZ arrays? About 30-40% runtime (that I see in perf tool) is spent sorting the Cube contents, so this is an bottleneck.

nsch0e commented 1 year ago

Should CUBES_MAX_N be configurable with cmake?

I thinks that would be a good idea and then set sensible default (~20).

About 30-40% runtime (that I see in perf tool) is spent sorting the Cube contents, so this is an bottleneck.

sort could be slow because of conversion to uint32_t for each XYZ compare... (haven't done any profiling)

JATothrim commented 1 year ago

@nsch0e Quick question: Are base cubes that are passed into expand() already guaranteed to be sorted? I'm making patch that replaces the XYZSet with std::vector<XYZ> and some STL algorithms.

nsch0e commented 1 year ago

@JATothrim they are expected to be sorted before entering Hashy, so they have to be sorted in cache.

Answer: Yes, they are sorted, when passed to expand!

nsch0e commented 1 year ago

@JATothrim be aware, that candidates have the special case, where coordinates can be negative. In the uint32_t repr. negative numbers are cast to uint8_t -> so big value. example: (0,0,0)<(-1,0,0) I think this should be fixed, as it is unintuitive.

perhaps candidate creation can be optimized so most points are already sorted.

JATothrim commented 1 year ago

@JATothrim be aware, that candidates have the special case, where coordinates can be negative. In the uint32_t repr. negative numbers are cast to uint8_t -> so big value. example: (0,0,0)<(-1,0,0) I think this should be fixed, as it is unintuitive.

perhaps candidate creation can be optimized so most points are already sorted.

I just looked, and yes, XYZ(-1,0,0) < XYZ(0,0,0) is false. However the code above should not care as long as the sorting criteria is same for both input ranges to std::set_difference(). Only difference should be that the final candidates are processed in different order than with XYZSet.

The XYZ::operator< should be fixed though.

EDIT: I verified the new code indeed generates exactly same set of candidates: I compared the sets that old code produces and this, both ways.

JATothrim commented 1 year ago

@nsch0e I tested changing the XYZ comparisons be like: std::tie(data[0],data[1],data[2]) == std::tie(b.data[0],b.data[1],b.data[2]) For so XYZ(-1,0,0) < XYZ(0,0,0) is true. It seems to work, but it is slower. constexpr operator uint32_t() const { return (uint8_t(data[0] + 128) << 16) | (uint8_t(data[1] + 128) << 8) | (uint8_t(data[2]+128)); } Instead also works and is slightly better but still slower. So considering performance I would not "fix" this. It's a feature. :wink:

@bertie2 @nsch0e If all patches here are so far ok, I'm good with merging them.

I need to fiddle with CMakeLists and my work branch is now too old for that.

nsch0e commented 1 year ago

@bertie2 please merge this. Improvements are good. And the "comparison feature" doesn't matter because negative coordinates only appear in candidates.

bertie2 commented 1 year ago

happy with this and merged. the time to n=13 is now down to 3 minutes at least on my machine, thought the threads are getting more and more spread in completion time. fastest thread finishes in 86 seconds, slowest in 152.

JATothrim commented 1 year ago

happy with this and merged. the time to n=13 is now down to 3 minutes at least on my machine, thought the threads are getting more and more spread in completion time. fastest thread finishes in 86 seconds, slowest in 152.

Cool, I'm actually amazed that this trimmed nearly +40s from what I started: cubes -n 12 -t 2 297.87s user 2.04s system 180% cpu 2:46.07 total down to: cubes -n 12 -t 2 213.54s user 1.90s system 180% cpu 1:59.52 total

I think the threads getting more spread out could be mitigated by just scheduling the work in smaller pieces: When one chunk is completed thread would grab next one?

nsch0e commented 1 year ago

I think the threads getting more spread out could be mitigated by just scheduling the work in smaller pieces: When one chunk is completed thread would grab next one?

@JATothrim This is exactly what I'm doing in #26 . Please check it out since memory consumption is reduced massively by only having one shape at a time in hashy.

the time to n=13 is now down to 3 minutes at least on my machine, thought the threads are getting more and more spread in completion time.

@bertie2 I think unfortunately we have to trade speed for memory next.

bertie2 commented 1 year ago

I think the threads getting more spread out could be mitigated by just scheduling the work in smaller pieces: When one chunk is completed thread would grab next one?

@JATothrim This is exactly what I'm doing in #26 . Please check it out since memory consumption is reduced massively by only having one shape at a time in hashy.

the time to n=13 is now down to 3 minutes at least on my machine, thought the threads are getting more and more spread in completion time.

@bertie2 I think unfortunately we have to trade speed for memory next.

fully agree, I have my own branch for doing the set matching in files, I will look at yours first though as mine is a bit jank.

mikepound / opencubes

Optimize Rotations::rotate() more #20