Closed TylerWilley closed 3 years ago
I should note that I monkey patched the unit tests now.
During the tests the fork used by multiprocess.Process was the builtin, which does not automatically execute reinit.
I have not investigated why the patched fork in gevent works, but the builtin plus a manual reinit in _child does not.
Patching causes the tests to use gevent's fork, which does work properly
I guess as a side effect, re-init allows copied greenlets to continue executing... so I'll see if I can add a unit test for that behavior and revert it.
Edit: This has been fixed in most recent commit. It seems that gevent for whatever reason, will not raise the LoopExit on the destroy'ing thread anymore. So I simply destroy the loop directly, which prevents anything from executing... Then set the old hub to None, which forces the creation of a new hub and loop. This doesn't seem to have any worse side effects than the previous dirty hub destroy as far as I can tell. It also removes the need to have the tests run while monkey patched.
@TylerWilley thank you for all this energy investment here!
(I so much love to see this contribution! :heart:)
I am super positive about all this; and see that you found out and understand quite a lot already. I have very little bandwidth right now and just start asking some questions and raise some points, some of them might appear as ignorant but it's better than waiting N more days.. so please bear with me.
Python3.8 changed the default method for popen from fork to spawn
Only on macOS, right?
fork()
again (as before) but what do you think about https://bugs.python.org/issue33725 and its implications -- do you think that people using gipc on macOS are generally OK with switching this back to fork()
? I have always thought of gipc's macOS support being basically only for developer experience -- in prod, I suppose, people always run gipc on Linux. I suppose re-enabling gipc's dev experience on macOS may justify this pretty invasive change.I considered rewriting gipc to work with spawn instead of fork, however this is significantly more complicated.
Thanks for sharing, yeah, this is also why back then I simply created https://github.com/jgehrcke/gipc/issues/100 and left it at that.
So I simply destroy the loop directly
:100: -- what we really want to make sure is that after all we call libev's ev_loop_destroy in the child to guarantee (that is how I always understood it) that greenlets inherited from the parent will not run anymore.
If you have not seen it, https://metacpan.org/pod/distribution/EV/libev/ev.pod#ev_loop_fork-(loop) also has a nice piece of documentation.
Next up, I will go through code changes and ask some maybe stupid / ignorant questions from the top of my head : )
For Linux (the primary platform that gipc is targeted at) the key change here actually seems to be to use hub.loop.destroy()
instead of hub.destroy(destroy_loop=True)
. As I described before, there is a crash somewhere in hub.destroy(destroy_loop=True)
when running gipc's tests on Linux.
The relevant code section in gevent describes via code comment (thanks @jamadden) for why there is value to use hub.destroy(destroy_loop=True)
instead of hub.loop.destroy()
directly:
# Let the frame be cleaned up by causing the run() function to
# exit. This is the only way to guarantee that the hub itself
# and the main greenlet, if this was a secondary thread, get
# cleaned up. Otherwise there are likely to be reference
# cycles still around. We MUST do this before we destroy the
# loop; if we destroy the loop and then switch into the hub,
# things will go VERY, VERY wrong (because we will have destroyed
# the C datastructures in the middle of the C function that's
# using them; the best we can hope for is a segfault).
try:
self.throw(HubDestroyed(destroy_loop))
except LoopExit:
# Expected.
pass
except GreenletError:
# Must be coming from a different thread.
# Note that python stack frames are likely to leak
# in this case.
pass
Now, if we really really understand what is going on, we may want to deviate from this recommendation and resort to using hub.loop.destroy()
directly again -- this seemingly unbreaks the tests on Linux.
Oh and just for your own curiosity: https://github.com/jgehrcke/gipc/issues/52
I have released gipc 1.2.0: https://github.com/jgehrcke/gipc/blob/master/CHANGELOG.rst#version-120-jun-3-2021 -- can you maybe base your work on the current state of the master branch, and address this from the perspective of re-enabling compatibility between gipc >= 1.2.0, gevent >=21.x, and CPython >= 3.8
on macOS (platform support matrices are so much fun! : ))?
So, we have multiple development environments, as this runs from a core library that multiple applications use.
Most developers execute our application in macOS, but the application runs in production on Linux, I can run the application in a linux docker container on macOS and it runs fine for me there as well (I've also deployed it to our staging env and it works there). Some parts of the application rely on the process space being copied instead of recreated clean, so the 3.8 change on macOS to use spawn breaks our dev environments. If we want to call this an application concern and not a gipc concern, that's fine and we can just move that code into our application. It seems like it would be a concern for others though, as it's a significant change in fork behavior. In development, we run with OBJC_DISABLE_INITIALIZE_FORK_SAFETY in osx for that reason (to more closely mirror prod/linux)
On loop destruction, I haven't traced through the C code on what exactly it's doing, I'm not as familiar with libuv. But it appears that previously, _check_loop was returning 1, and now loop._ptr is not null anymore at that point in the forked process, e.g. something post fork is now automatically clearing that in the loop. This causes stuff to break as LoopExit gets thrown up the greenlet chain. Since we don't actually care about executing the loop or it's greenlets, I just destroy it directly instead of trying to go through the "happy path" which tries to throw LoopExit on all the greenets. Then set the hub to None so the following reinit call creates a new hub and loop fresh. This does stop greenlets from executing across the fork though.
One of the benefits to switching to spawn is none of this matters anymore and it's always clean, but as I mentioned, that probably won't work for us :/ (well, it's possible just significant refactoring and testing for the behavior change)
So new wrinkle, when testing spawn() on osx, file descriptors aren't passed because
In order to get a descriptor into that list, you need to call DupFd between these lines
ForkingPickler does the serialization of the process
Now... short of subclassing and injecting a new Popen method, which I suppose we could do, the easiest way to do this is to register _GIPCReader / _GIPCWriter / etc classes with multiprocessing.reduction as done here
An alternative to that would be to switch from os.pipe() to multiprocessing.Pipe(), which is a larger change to gipc... but multiprocessing.Pipe() is already registered with reduction by default.
Thanks @TylerWilley -- so much! Will get back to this soon; maybe still today, potentially next week. :rocket: Genuinely curious: what kind of product do you work on that uses gipc? Can also email me at jgehrcke@googlemail.com if that's better :). Learning of use cases is kind of the deepest satisfaction I can get from all this :).
@jgehrcke Did you have a preference on any of my questions on the PR thus far?
specifically, using reduction as an import to handle passing the file descriptors for the pipes to the subprocess when spawn is used
I added a windows check so this should work for:
Thx :)
Sorry that it took a while to get back to you! I wanted to manually run CI on this here but GH actions is just a little bit complex and challenging. I think I changed things so that when you add a commit here next time that GH will actually run CI (maybe after I manually approve). Let's try that.
Thanks for all the hard work here. I love how the changes after all got simpler and simpler.
Can you also do a clean rebase against current master and squash commits a bit?
I clean rebased and force-pushed... seems that the workflow isn't running automatically though. No rush though :)
Thanks for the update! I approved the CI run. Next time I hope it runs automatically.
I am sure you can discover this for yourself, but it looks like you still need to make the linter happy : )
+ flake8 gipc/
gipc/gipc.py:1079:5: E303 too many blank lines (2)
gipc/gipc.py:1084:5: E303 too many blank lines (2)
gipc/gipc.py:1087:5: E303 too many blank lines (2)
gipc/gipc.py:1092:5: E303 too many blank lines (2)
gipc/gipc.py:1097:5: E303 too many blank lines (2)
Error: Process completed with exit code 1.
(https://github.com/jgehrcke/gipc/pull/110/checks?check_run_id=2846345188#step:8:19)
> s = gevent.signal(signal.SIGTERM, signals_test_sigterm_handler)
E TypeError: 'module' object is not callable
ah, that's probably why...
Yo! Thanks, for real. Glad we got this over the finish line. Will look into making a release in the coming days!
Thanks for the nice collaboration.
No problem! It was great working with you, and thanks for all the help getting it perfected :)
This is now released as https://pypi.org/project/gipc/1.3.0/ :)
That's great! Thank you, I will update our dependencies ^_^
On Fri, Jul 23, 2021, 8:51 AM Dr. Jan-Philip Gehrcke < @.***> wrote:
This is now released as https://pypi.org/project/gipc/1.3.0/ :)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jgehrcke/gipc/pull/110#issuecomment-885694493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQMQPDP75JL5S3FWEV55NSDTZF6XJANCNFSM455RYSGA .
So for fixing unit tests:
Python3.8 changed the default method for popen from
fork
tospawn
. Spawn does not copy any file descriptors or process state, it's a clean process... So most of the gipc issues came from that. This forces it back to usingfork
instead.When using
spawn
mode, you need to set inheritable explicitly on file descriptors for the sub process to inherit them, I left this in as a reminder, and it doesn't change anything when usingfork
Additionally, gevent has cleaned up their reinit so that it appears it works fine without having to do a dirty reinit on fork. When monkey-patched, reinit happens automatically, so I'm not sure if gipc needs to continue explicitly re-init'ing but it doesn't hurt anything to do it twice.
Other than that, some random relocations of modules and such. signal.signal() has always returned the prior signal handler, so rather than call .cancel() I just set it back, it was tracing.
Other options:
I considered rewriting gipc to work with
spawn
instead offork
, however this is significantly more complicated. As our application relies on globals being copied to the forked process, I did not do this. It seems like it could be a dangerous / breaking change. A benefit tospawn
is that there is no gevent hub or state to clean up, as it is not copied from the old process.A possible improvement would be to update gipc to support both
fork
andspawn
and let the user decide.