Chipmunk has multithreading on windows - why does pymunk not?

Berserker66 commented 6 years ago

So, with Chipmunk 7 all references to the multithreading not working on windows were removed from the documentation, it proudly claims to multithread without an asterisk for platforms and here we can see they patched in pthread support for windows: https://github.com/slembcke/Chipmunk2D/blob/master/src/cpHastySpace.c

Is there a reason why pymunk does not support this?

viblo commented 6 years ago

Yes. The problem with the threading is to make it work with both Chipmunk source and with the compiler used to compile it on windows at the same time. Mingwpy is what's used to compile on windows and it uses win32 threads.

If you want to experiment its easy to enable it: Update space.py here: https://github.com/viblo/pymunk/blob/master/pymunk/space.py#L64 And modify the setup.py here: https://github.com/viblo/pymunk/blob/15e60fe558a43dff21cc77f6313f9a9cf4eaab3e/setup.py#L118

Berserker66 commented 6 years ago

Mh. I have other tasks that need doing. So, for now just some questions - In my case, the compiler is Visual Studio 2017 for python 3.6.3. If I use that instead of mingwpy, would that work?

If I have a python thread calling space.step() every now and then, can I have pymunk run in parallel to other cpu tasks? I realize this would still limit pymunk to only using one thread, but would it have its own separate thread and release the gil while in the step function?

viblo commented 6 years ago

It wont work directly. There are compiler options specific to gcc used in setup.py to compile chipmunk, so that would need changing. There might also be other issues to get Chipmunk to compile properly, and work good with CFFI. When i have tried previously I gave up after a while and went back to just using GCC.

Pymunk is using CFFI to call into the C-code. I believe CFFI will release the GIL while its running the c-code, so your thread idea should work (but I havent tested it myself)

Berserker66 commented 6 years ago

Well, it was more disappointing than I expected. Admittedly not due to hastyspace, but just cffi.

My Terrain consists of lots of blocks stored in chunks. For this test I have the same seed and terrain size and I generate Segments and attach them to a Body per Chunk.

I then have a test where I ramp up the amount of threads used, to see how well or bad something scales with threading.

Looks a bit like this:

        for x in range(1,16):
            testworld = World()  # GC old testworld
            func = lambda pair: testworld.make_chunk_CP(pair)
            pool = futures.ThreadPoolExecutor(x)
            with timer.TimeItPrint("worldgen with collision-multithreaded {} threads".format(x)):
                pool.map(func, args)
                pool.shutdown()

(for 1024x1024 Tiles, with 16x16 tiles in a chunk, for a 64 by 64 chunk area) Now, the basic world generation is done in partly gilless Cython, which runs pretty quick, here the times for reference:

worldgen with collision-multithreaded 1 threads took 0.4260915377573183 seconds.
worldgen with collision-multithreaded 2 threads took 0.2746772413268696 seconds.
worldgen with collision-multithreaded 3 threads took 0.254787896270646 seconds.
worldgen with collision-multithreaded 4 threads took 0.2829108840265461 seconds.
worldgen with collision-multithreaded 5 threads took 0.27637665407596534 seconds.
worldgen with collision-multithreaded 6 threads took 0.2864180350177996 seconds.
worldgen with collision-multithreaded 7 threads took 0.3145630499630969 seconds.
worldgen with collision-multithreaded 8 threads took 0.31878291631538813 seconds.
worldgen with collision-multithreaded 9 threads took 0.3789126172344348 seconds.
worldgen with collision-multithreaded 10 threads took 0.33105417300576967 seconds.
worldgen with collision-multithreaded 11 threads took 0.3531706781296564 seconds.
worldgen with collision-multithreaded 12 threads took 0.290472244836133 seconds.
worldgen with collision-multithreaded 13 threads took 0.3963077602693472 seconds.
worldgen with collision-multithreaded 14 threads took 0.3172220032914055 seconds.
worldgen with collision-multithreaded 15 threads took 0.36971913221322517 seconds.

This is on an 8 core 16 thread CPU, but obviously the gil catches up with me long before that, but it is fast enough. So, alright, I was expecting the addition of pymunk Segments to add a flat time to the task, which would have been fine. But instead this is the result:

worldgen with collision-multithreaded 1 threads took 1.8592703991387067 seconds.
worldgen with collision-multithreaded 2 threads took 2.5476279130957975 seconds.
worldgen with collision-multithreaded 3 threads took 3.868741788837104 seconds.
worldgen with collision-multithreaded 4 threads took 3.841760292279682 seconds.
worldgen with collision-multithreaded 5 threads took 3.8905456889169443 seconds.
worldgen with collision-multithreaded 6 threads took 3.8522247159548684 seconds.
worldgen with collision-multithreaded 7 threads took 3.800815368098476 seconds.
worldgen with collision-multithreaded 8 threads took 3.8018919846980452 seconds.
worldgen with collision-multithreaded 9 threads took 3.917598396662008 seconds.
worldgen with collision-multithreaded 10 threads took 4.474359144572588 seconds.
worldgen with collision-multithreaded 11 threads took 4.241469593728105 seconds.
worldgen with collision-multithreaded 12 threads took 4.083550638094479 seconds.
worldgen with collision-multithreaded 13 threads took 4.435923811271046 seconds.
worldgen with collision-multithreaded 14 threads took 4.616944158962681 seconds.
worldgen with collision-multithreaded 15 threads took 4.549076879718875 seconds.

Profiling a bit into it, I see that pymunk.Segment causes five (!) calls to _thread._acquire(), presumably on the GIL, but my profiler doesn't tell me that. Digging into it, it seems that ffi.gc, cpSegmentShapeNew ,cpShapeSetUserData, ffi.cast, cpShapeFree all acquire the gil and then release it, causing significant overhead as well as the occasional wait, accounting for up to 90% of CPU time spent. With another 5-10% spent with weak sets, which seem to be hinting at the body._shapes.add() - admittedly, most of that seems to be cleaning up the weak refs and I am profiling the GC here too. But Chunks will get GCed in the actual program, so this being measured is acceptable.

I realize, at this point that Cymunk might be a good option for me, (https://github.com/kivy/cymunk) but that project doesn't compile for me. So, yeah - I guess I'm just disappointed and may have to bake my own solution :(

viblo commented 6 years ago

Have you tried using Pypy? In my own limited tests it is much faster with CFFI than CPython. (but if you do cython maybe its not possible/too slow)

Berserker66 commented 6 years ago

My project targets python 3 on windows as the main platform - for which there is no pypy at this point.

As well, the JIT compilation could be a problem for my project as it may choke at an inopportune moment in a heavily multithreaded environment - one of the problems they have not fixed yet and may never be able to, due to architecture.

As such, for now the best options are cython, rpython and making my own C-Extensions. I doubt I need to tell you how much work C Extensions are, so those fall out as I'm the only developer on this project. rpython might be a nice alternative, it is part of the PyPy translation toolchain and allows a module to be compiled in advance, as opposed to JIT'ed, but again, no python3 windows support and also very limited documentation at this point.

Which leaves me with cython, which does the job rather well most of the time. It also can interact with cffi, in that it uses cffi just like regular python would - it goes back into regular interpreted python, with gil and without the speed advantage, but it works without cons compared to CPython, just also without the pros.

If you'd like I can do the benchmarks, or a similar one in a smaller self contained CPython only program and I can pretty much guarentee you the same multithreading issue will crop up, as that is on cffi's side, not cython.

Also, cython and pypy are compatible - but alas - no windows with python 3 - again.

Berserker66 commented 6 years ago

Here the results of the pure CPython idea. This is the test script I made:

import random
import pymunk
from concurrent import futures
import time

n = 100000
seed = 0
max_threads = 10

for t in range(1, max_threads):
    random.seed(seed)
    body = pymunk.Body()
    tasks = [
        (body, (random.randint(0, 1000), random.randint(0, 1000)), (random.randint(0, 1000), random.randint(0, 1000)), 0)
        for _ in range(n)]
    pool = futures.ThreadPoolExecutor(t)
    start = time.perf_counter()
    pool.map(pymunk.Segment, tasks)
    pool.shutdown()
    print(time.perf_counter()-start, "seconds for {} threads".format(t))

and running this on regular old CPython, you can see this:

5.3273119647468405 seconds for 1 threads
4.5902601803332805 seconds for 2 threads
5.302481589192313 seconds for 3 threads
6.0498526592248965 seconds for 4 threads
6.017503770270295 seconds for 5 threads
6.188009303319223 seconds for 6 threads
6.598520074614839 seconds for 7 threads
6.984501005707152 seconds for 8 threads
6.6542618387090755 seconds for 9 threads

As you can see, the threadlock overhead is not unique to Cython - Cython does emphasize it more because it has less overall overhead, so the cffi threading overhead has more effect relative to the total time spent.

What really would be needed is someway to say "with cffi.gil" to wrap a section of code that makes multiple cffi calls in order to NOT release and re-acquire the GIL several times in a short function call. Cython has this- both "with gil" and "with nogil" - obviously cffi couldn't ever implement the with nogil, as it needs the gil to handle python arguments and return values, but a with gil might be nice.

I realize at this point, that this is more a cffi problem and less something pymunk can fix reasonably. Should I make a ticket with them and link them here?

viblo commented 6 years ago

There is a thread about a with cffi.gil function here: https://bitbucket.org/cffi/cffi/issues/220/some-way-to-annotate-a-call-to-ask-it-not (maybe you found it already). Seems like Armin is not super keen on adding such a feature to cffi, but maybe its worth it to open a new issue since you have some good arguments about it.

viblo commented 6 years ago

Thinking about your problem a bit more, could it help with a object pool of Segments? Now I dont know your case exactly, but if its only the position that needs updating maybe you can save a bit.

I tried running your code myself, but I get very varying results between each run and number of threads so I couldnt get any conclusions from a quick test I did myself. Maybe its because Im on a laptop now with only two cores and 2 hyperthreads and running on battery..

Berserker66 commented 6 years ago

The problem gets more and more pronounced the more cores you actually have, as that creates more and more threads fighting for the lock and missing more often than not. So yeah, on a 2 core you are not likely to see it well.

As for the pool, I don't think it is going to work - I have an infinite 2D tile world, similar examples would be Terraria or Starbound. So, the player creates new segments all the time as he moves and old Segments get GCed as the chunks get outside the relevant area and get saved to disk. I could implement some kind of pool to recycle segments, but as the body they're attached to changes anyway, I don't see any performance to be gained here.

What I was considering is to redirect all the pymunk creation to a single thread through a queue, but turns out that putting the data into a queue and then retrieving it also adds significant overhead, at least using python's queue module. I might be able to negate the overhead by using c++ vectors through cython, but in any event it is likely to get ugly.

viblo commented 6 years ago

Ah, I assumed that a chunk would have its own body. Then you could move away the chunk out of screen without removing it from the space so that you dont need to call remove on each segment and only move the body once. Then on create you would take the body and put it in position and then adjust the position of all the segments in the chunk. But yeah, it all depends on how you would use them..

Berserker66 commented 6 years ago

The chunk does have its own body. I mean, I might also be doing things suboptimal anyway, given that this is my first time with Pymunk.

This is a Region of my Terrain. To reiterate - 16x16 World Units = Tile 16x16 Tiles = Chunk 16x16 Chunks = Region

This is from a debug view that is created with PIL and not what the game actually looks like, because therer is no vulkan debug layer for pymunk, so I had to make my own solution. https://images-eu.curseapp.net/3bb69e17-4255-478d-86d0-fd7455a3d67e/collision_mask.png

The green stuff is solid terrain tiles, the greyscale is air, they are colored differently to make it easier to identify chunks. All the black lines are segments, generated like so (Big heap of cython here, also quite a bit missing so it's not quite as massive)

    cdef void _create_geometry(self, WrappedSegmentVector wrappedvector) nogil:
        cdef int i,x,y, cur, right, bottom, bottomright, neighbourmask
        right = bottom = bottomright = cur = 0
        for i in range(chunklength):
            if self.dataview[i,0]:
                wrappedvector.v.push_back(_make_segment(i, 0, i+1, 0))
            if self.dataview[i,15]:
                wrappedvector.v.push_back(_make_segment(i, 16, i+1, 16))
            if self.dataview[0,i]:
                wrappedvector.v.push_back(_make_segment(0, i, 0, i+1))
            if self.dataview[15,i]:
                wrappedvector.v.push_back(_make_segment(16, i, 16, i+1))

        for y in range(chunklength-1):
            for x in range(chunklength-1):
                cur = self.dataview[x,y]!=0
                right = self.dataview[x+1,y]!=0
                bottom = self.dataview[x,y+1]!=0
                bottomright = self.dataview[x+1,y+1]!=0
                neighbourmask = cur<<0 | right<<1 | bottom<<2 | bottomright<<3

                if neighbourmask == 0x0:
                    pass
                elif neighbourmask == 0xF:
                    pass
                elif neighbourmask == 0x1:#just topleft
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+1))#topmiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x  , y+1))#middle to leftmiddle
                elif neighbourmask == 0x2:#just topright
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+1))#topmiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x+2, y+1))#middle to rightmiddle
                elif neighbourmask == 0x3:#top side
                    wrappedvector.v.push_back(_make_segment(x  , y+1, x+2, y+1))#horizontal middle
                elif neighbourmask == 0x4:#just bottomleft
                    wrappedvector.v.push_back(_make_segment(x+1, y+2, x+1, y+1))#bottommiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x  , y+1))#middle to leftmiddle
                elif neighbourmask == 0x5:#left side
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+2))#vertical middle
                elif neighbourmask == 0x6:#topright and bottomleft
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+2))#vertical middle
                    wrappedvector.v.push_back(_make_segment(x  , y+1, x+2, y+1))#horizontal middle
                elif neighbourmask == 0x7:#not bottomright
                    wrappedvector.v.push_back(_make_segment(x+1, y+2, x+1, y+1))#bottommiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x+2, y+1))#middle to rightmiddle
                elif neighbourmask == 0x8:#just bottomright
                    wrappedvector.v.push_back(_make_segment(x+1, y+2, x+1, y+1))#bottommiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x+2, y+1))#middle to rightmiddle
                elif neighbourmask == 0x9:#topleft and bottomright
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+2))#vertical middle
                    wrappedvector.v.push_back(_make_segment(x  , y+1, x+2, y+1))#horizontal middle
                elif neighbourmask == 0xa:#right side
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+2))#vertical middle
                elif neighbourmask == 0xb:#not bottomleft
                    wrappedvector.v.push_back(_make_segment(x+1, y+2, x+1, y+1))#bottommiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x  , y+1))#middle to leftmiddle
                elif neighbourmask == 0xc:#bottom side
                    wrappedvector.v.push_back(_make_segment(x  , y+1, x+2, y+1))#horizontal middle
                elif neighbourmask == 0xd:#not topright
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+1))#topmiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x+2, y+1))#middle to rightmiddle
                elif neighbourmask == 0xe:#not topleft
                    wrappedvector.v.push_back(_make_segment(x+1, y  , x+1, y+1))#topmiddle to middle
                    wrappedvector.v.push_back(_make_segment(x+1, y+1, x  , y+1))#middle to leftmiddle

cdef inline Segment _make_segment(int x1, int y1, int x2, int y2) nogil:
    cdef Segment seg
    seg.x1 = x1*16
    seg.y1 = y1*16
    seg.x2 = x2*16
    seg.y2 = y2*16
    return seg

cdef class WorldGenerator:
    def __init__(self, world):
        self.world = world
        self.noise = SimplexContext(self.world.seed)

    def make_chunk(self, int x, int y):
        cdef unsigned int i
        cdef Segment segment
        cdef WrappedSegmentVector vec = WrappedSegmentVector()
        cdef vector[Segment] segments
        cdef Chunk new = Chunk()
        with nogil:
            generate_default(self, new, x, y)
            new._create_geometry(vec)
        for segment in vec.v:
            new.shapes.append(pymunk.Segment(new.body, (segment.x1, segment.y1), (segment.x2, segment.y2), 1))
        self.world.chunks[(x,y)] = new
        return new

So, in a nutshell, I create segments as a C++ Vector of Segment structs (they're simple .x1, .x2, .y1, .y2 containers) the generate_default just fills a buffer with tile IDs based on simplex noise.

If I just comment out this section:

        for i in range(i):
            segment = segments[i]
            pymunk.Segment(new.body, (segment.x1, segment.y1), (segment.x2, segment.y2), 1)

the entire thing takes 0.3 seconds for a 1024x1024 tile area.

As soon as I use that line though, it goes to 1.7 seconds and only gets worse the more threads I throw at it.

Berserker66 commented 6 years ago

I've found a "good enough for now" solution - The Chunk borders are always also segments, due to the chunk itself not knowing what might be in the next chunk. However, I created a segment for each tile that was bordering, so a fully filled Chunk would have 4x16 segments.

I now made an adjustment, so that on the border it looks for continous runs of tiles, to merge the segments. This changed the 1.7 seconds base time to 0.62. It still behaves badly with more threads, but there are now a lot less calls to pymunk.Shape in the first place, so there are also less threadlock collisions.

Running test for 1024 by 1024 tiles (64 by 64 chunks).
worldgen with collision-multithreaded 1 threads took 0.6368153709580959 seconds.
worldgen with collision-multithreaded 2 threads took 0.5071847690500029 seconds.
worldgen with collision-multithreaded 3 threads took 0.6551027218902059 seconds.
worldgen with collision-multithreaded 4 threads took 0.6497009439080486 seconds.
worldgen with collision-multithreaded 5 threads took 0.6800395522878953 seconds.
worldgen with collision-multithreaded 6 threads took 0.7151275733277331 seconds.
worldgen with collision-multithreaded 7 threads took 0.6900440965162722 seconds.
worldgen with collision-multithreaded 8 threads took 0.7416693672539161 seconds.
worldgen with collision-multithreaded 9 threads took 0.6880341957711513 seconds.
worldgen with collision-multithreaded 10 threads took 0.7619973965735429 seconds.
worldgen with collision-multithreaded 11 threads took 0.7593456914619798 seconds.
worldgen with collision-multithreaded 12 threads took 0.749971485419545 seconds.
worldgen with collision-multithreaded 13 threads took 0.7443998269210237 seconds.
worldgen with collision-multithreaded 14 threads took 0.7639333706285925 seconds.
worldgen with collision-multithreaded 15 threads took 0.7814419265114392 seconds.

I'm still not exactly happy with it, but I need to carry on and deal with other construction sites.

viblo commented 6 years ago

I tried with a different c compiler I had already on my computer (TDM-GCC-64), and it seemed like it was working fine with the threaded solver both on 32 and 64 bit pythons. Same thing with the conda gcc.. So maybe it can be enabled for all on windows without downsides when build scripts are updated to use something else than mingwpy.

Berserker66 commented 6 years ago

I'd certainly appreciate that. I already get a good 3-4 cpu threads of use when I disable my frame-limiter and if the physics are also able to scale along that'd be nice. Admittedly we're talking ~1000 fps here, but I do need space to grow my not yet implemented features ;) https://i.imgur.com/6aAQFiO.png

Berserker66 commented 6 years ago

Thank you!

viblo commented 6 years ago

I suspect you wont gain so much, but its worth trying at least.

Since I was looking at the threaded space implementation a bit I also did a couple of experiments with the NEON optimizations. Its arm only, but there is a automatic converter here: https://github.com/intel/ARM_NEON_2_x86_SSE that I tried. It only supports floats (32bit), so I had to configure chipmunk for that. However, the results were not very good. Using floats instead of double seems to work ok in Chipmunk general, but with this SSE converter my simulation became very unstable. For now I think you would have to rewrite that part to SSE manually to gain anything. (I have 0 experience of SSE)

Berserker66 commented 6 years ago

Fair enough - it's a start and if it improves in the future upstream, pymunk is now ready to benefit from it.

I need to use 64 bit float / double anyway, as stuff can happen pretty far from the origin.

viblo / pymunk

Chipmunk has multithreading on windows - why does pymunk not? #125