Closed bitranox closed 2 years ago
I don't have a Windows computer and I am a little bit hesitant about running a potential fork bomb on a computer that isn't mine. Looking into VMs.
In the meantime, can you send the output of the dumps line in
dill.detect.trace(True)
c = dill.dumps(ClassTest1.f)
dill.loads(c)
with both dill 0.3.4 and 0.3.5.1 and whether dumps or loads is causing the infinite loop.
If dumps succeeds, can you share c
as well?
Dear anivegesana Oh, its not a fork bomb - the timeout is like 3 seconds on the original test, so after 4-5 forks (well on windows there is no real fork, therefore its very slow ...) the timeout kicks in. I guess the dill dump will be difficult, because it happens in a subprocess - kind of difficult to get it out there ? but I will try to share the requested data for windows/linux/0.3.4/0.3.5.x .... thanks for Your effort greetings from Vienna, Austria Robert
Dear anivegesana,
it is as I thought - I can not even pickle the decorated class method.
And I usually dont do that - I just use multiprocess
which is using dill
.
with dill 0.3.4 on windows - I cant pickle the function, but multiprocess is working like intended
def test_dill():
"""
>>> test_dill()
"""
import dill
dill.detect.trace(True)
c = dill.dumps(ClassTest1.f)
dill.loads(c)
test_dill()
Exception raised:
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2022.1.1\plugins\python\helpers\pycharm\docrunner.py", line 138, in __run
exec(compile(example.source, filename, "single",
File "<doctest test_dill[0]>", line 1, in <module>
test_dill()
File "C:\opt/rotek-apps/lib/bitranox/wrapt_timeout_decorator/tests\wrapt_timeout_test_helper.py", line 35, in test_dill
c = dill.dumps(ClassTest1.f)
File "C:\opt\rotek-apps\lib\bitranox\wrapt_timeout_decorator\venv\lib\site-packages\dill\_dill.py", line 304, in dumps
dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
File "C:\opt\rotek-apps\lib\bitranox\wrapt_timeout_decorator\venv\lib\site-packages\dill\_dill.py", line 276, in dump
Pickler(file, protocol, **_kwds).dump(obj)
File "C:\opt\rotek-apps\lib\bitranox\wrapt_timeout_decorator\venv\lib\site-packages\dill\_dill.py", line 498, in dump
StockPickler.dump(self, obj)
File "C:\Program Files\Python310\lib\pickle.py", line 487, in dump
self.save(obj)
File "C:\Program Files\Python310\lib\pickle.py", line 578, in save
rv = reduce(self.proto)
NotImplementedError: object proxy must define __reduce_ex__()
thanks for Your effort greetings from Vienna, Austria Robert
@anivegesana: my build and test boxes are (docker instances inside) VMs. FYI.
Dear anivegesana, here the next attempt to shed some light on the issue:
def test_timeout_class_method(use_signals: bool) -> None:
"""
>>> test_timeout_class_method(False)
"""
import dill.detect
dill.detect.trace(True)
assert ClassTest1().f(dec_timeout="instance.x", dec_allow_eval=True, use_signals=use_signals) is None
under dill 0.3.4 - it works. under dill 0.3.5.1, the same output, but looping until timeout (of course with different object adresses in each loop)
D2: <dict object at 0x000002BE74FBF680>
T4: <class 'multiprocess.process.AuthenticationString'>
# T4
# D2
T4: <class 'multiprocess.context.Process'>
# T4
D2: <dict object at 0x000002BE74FD7480>
D2: <dict object at 0x000002BE72CE6140>
T4: <class 'multiprocess.process.AuthenticationString'>
# T4
# D2
F2: <function _target at 0x000002BE74FB84C0>
# F2
T4: <class 'wrapt_timeout_decorator.wrap_helper.WrapHelper'>
# T4
D2: <dict object at 0x000002BE74FBD600>
T1: <class 'TimeoutError'>
F2: <function _load_type at 0x000002BE74A243A0>
# F2
# T1
Me: <bound method ClassTest1.f of <wrapt_timeout_test_helper.ClassTest1 object at 0x000002BE74FC1F00>>
T1: <class 'method'>
# T1
F1: <function ClassTest1.f at 0x000002BE74FB8D30>
F2: <function _create_function at 0x000002BE74A244C0>
# F2
Co: <code object f at 0x000002BE74FAF050, file "C:\opt/rotek-apps/lib/bitranox/wrapt_timeout_decorator/tests\wrapt_timeout_test_helper.py", line 13>
F2: <function _create_code at 0x000002BE74A24550>
# F2
# Co
D4: <dict object at 0x000002BE74A2AC40>
# D4
D2: <dict object at 0x000002BE74FF0C40>
# D2
# F1
T4: <class 'wrapt_timeout_test_helper.ClassTest1'>
# T4
D2: <dict object at 0x000002BE72C9AC40>
# D2
# Me
D2: <dict object at 0x000002BE74FBE580>
# D2
F2: <function rebuild_pipe_connection at 0x000002BE74FEAA70>
# F2
T4: <class 'multiprocess.reduction.DupHandle'>
# T4
D2: <dict object at 0x000002BE74FF1E80>
# D2
# D2
D2: <dict object at 0x000002BE72CA00C0>
# D2
# D2
thanks for Your effort greetings from Vienna, Austria Robert
That output is very strange. It is matching, which means that dill is not going through infinite regress, multiprocess is just executing the same code over and over. It is even more bizarre that it only effects dill.
Dear anivegesana,
I guess it is the combination of some black magic from wrapt
decorator, together with multiprocess
and dill
. Just guessing.
The thing is, it worked fine until (including) Dill Version 0.3.4
.
Its a pitty, because my users are using that timeout decorator exactly because it worked well with windows.
I pinned down the Dill Version to 0.3.4 in the requirements.txt
on Windows for the time beeing, but I hope You can find the bug ...
Yours sincerely Robert
I am still having trouble reproducing this bug. I am getting the error you described when you pickled the function. I'll try other Windows computers, but it might be dependent on the environment and I might need to try it on a CI which will be incredibly painful to debug. ~First, I think I should make a small patch to the issue with pickling wrapt classes which hopefully will resolve the issue without resorting to that.~
Based on GrahamDumpleton/wrapt#158, I believe this is likely mostly a conscious choice by the wrapt author as opposed to a problem in dill or multiprocess. multiprocess should not spin into an infinite loop and silently fail in very strange corner cases, so I believe that this issue should still be left up or moved over to multiprocess in case we can find out what is causing that strange behavior.
They created the change to wrapt that disabled pickling of ObjectProxy objects in this commit: GrahamDumpleton/wrapt/13c55faf7d3e83e184b8a150f0df3de294698b25
So it is still a mystery how the new version of dill changed behavior for this error to bubble up now and why it was ignored earlier. Also, what was the monkey patch you referenced in https://github.com/bitranox/wrapt_timeout_decorator/issues/13#issuecomment-483225115?
Dear anivegesana,
I am still having trouble reproducing this bug
which trouble ? Its kind of hard to debug in a subprocesses, but I think You should concentrate to debug multiprocess & pickle together, instead of only pickle the function itself. This also did not work in the past.
I am getting the error you described when you pickled the function
dill 0.3.4
- but multiprocess
is working with my wrapt_timeout_decorator
and dill 0.3.4
. So it seems it does NOT have anything to do with that NotImplementedError: object proxy must define __reduce_ex__()
if sys.platform == 'win32'
or similar ? ...0.3.4
and 0.3.5
is causing the trouble - maybe You can git bisect
those changesNotImplementedError: object proxy must define __reduce_ex__()
does NOT pop up in the trace when using multiprocess
Also, what was the monkey patch you referenced in https://github.com/bitranox/wrapt_timeout_decorator/issues/13#issuecomment-483225115?
I dont remember - I did not tinker around with dill
or multiprocess
I really dont like mysteries ;-), it means there is a hidden bug in the code .... which should be found, even if it is "hard to debug"
You have all the tools on hand to find that bug - just by replay the changes between 0.3.4 and 0.3.5 - Your argument "I dont have a windows machine" seems somehow a little bit odd in times of virtual machines nowadays ...
please be so kind and consider to help here, because a number of astronomers rely on that package.
yours sincerely
Robert
Well, the mystery is indeed what is causing the strange behavior. The root cause is unknown, and it could be a bug in multiprocess
, dill
, or wrapt
. It being a Windows-only bug makes it more difficult, even in the day of VMs, as one needs a copy of Windows and potentially some time investment to set up the VM. Let's keep it civil, and figure it out together.
So step (1) is make sure whatever issue is being reported can be reproduced. Then step (2) is try to use diagnostic tools to find what the root cause is. dill
has dll.detect.trace
and some other tools in the dill.detect
module. You can see if it shows up with serialization variants in dill.settings
. You can check if it shows up in multiprocess
when the settings are changed as well -- such as trying the spawn
multiprocess.context
on a mac/linux box, or similarly trying to pickle with ForkingPickler
from multiprocess
. multiprocess
also has a number of loggers, including a debugger and a subprocess debugger. pathos
has subprocess profilers in pathos.profille
that can help see what functions are called that might not show up in the traceback or on the debugger.
I did not intend to make it seem that your problem is not a problem. In fact, I wanted to imply that there are likely multiple problems in multiple libraries that makes it very hard to diagnose and we should try to break it up into smaller pieces.
All I meant by not being able to reproduce it was that on some Windows computers it has one behavior and on other Windows computers it has another, which is very bad and is the first problem that should be tackled. It is impossible to get anywhere else without consistency.
I do have access to VMs and am trying it on different VM platforms, hoping to find a difference that is causing the behavior.
pathos.profiler
will probably help out a lot and I'll use it once I find a VM that causes the issue.
The most likely reason was that a change in dill broke a monkey patch that multiprocess made for Windows, but it breaks it differently based on other properties of the environment depending on what that patch does. I think it is in multiprocess because the other libraries are platform agnostic and there shouldn't be a reason that it performs differently on different machines with the same OS.
One thing I noticed was that you ran your test that passed on Windows with Python 3.9 and the one that failed on Windows with Python 3.10. I tried it out on Python 3.7. That might be a confounding factor here. When I find time, I'll try it out on different Python versions and see if that is the cause of the varying results.
Dear,mmckerns, Dear anivegesana,
Well, the mystery is indeed what is causing the strange behavior. The root cause is unknown, and it could be a bug in multiprocess, dill, or wrapt.
I did not intend to make it seem that your problem is not a problem. In fact, I wanted to imply that there are likely multiple problems in multiple libraries that makes it very hard to diagnose and we should try to break it up into smaller pieces.
Everything absolutely True, but dill
is the only component that changed - and we know the bug was introduced between 0.3.4 and 0.3.5 - and it only occurs on windows - so this should help to find the issue.
It being a Windows-only bug makes it more difficult, even in the day of VMs, as one needs a copy of Windows and potentially some time investment to set up the VM.
I disagree here - tests on windows are absolutely necessary, because many things work different on windows - especially forking for multiprocess
. Therefore a windows test vm sould always be in Your toolbox ...
One thing I noticed was that you ran your test that passed on Windows with Python 3.9 and the one that failed on Windows with Python 3.10. I tried it out on Python 3.7. That might be a confounding factor here.
I just updated the tests from 3.9
to 3.10
AFTER working on that issue. It makes no difference if 3.9
or 3.10
Let's keep it civil, and figure it out together
Sure - if I came across rude, please accept my sincere apologies - please consider the German/Austrian mentality, which is abolutely straight forward. (well Austrians a bit less the Germans ...)
The most likely reason was that a change in dill broke a monkey patch that multiprocess made for Windows, but it breaks it differently based on other properties of the environment depending on what that patch does. I think it is in multiprocess because the other libraries are platform agnostic and there shouldn't be a reason that it performs differently on different machines with the same OS.
Sounds good. You can let all the components be the same, just change dill
from 0.3.4
to 0.3.5
I totally agree with Your opinion, in fact I mentioned it before :
https://github.com/uqfoundation/dill/issues/2 it works under linux, also with dill '0.3.5' - so You might concentrate on the differences between linux and win32 inplementation ? There should be some if sys.platform == 'win32' or similar ? ...
greetings from Vienna, Austria, and thanks for Your help !
yours sincerely
Robert
If the only difference is the dill version and it works on some Windows computers and not others, I would assume that the wait has something to do with it. I tried it out on a big VM with 10 CPU cores and GitHub actions probably has just 2. I was to throw a wild guess, I would assume that it is related to #482 where a bug fix to dill's handling of recursive functions (#443) caused deeper searches for objects to pickle. Does adding
dill.settings['recurse'] = True
to the beginning of the test help? I am at a loss though, and the only way that I could hope to help is by recreating the entire setup on GitHub Actions (as opposed to Azure/others like I am trying now) and run your tests there.
The most likely reason was that a change in dill broke a monkey patch that multiprocess made for Windows, but it breaks it differently based on other properties of the environment depending on what that patch does. I think it is in multiprocess because the other libraries are platform agnostic and there shouldn't be a reason that it performs differently on different machines with the same OS.
What specifically are you referring to? I know the multiprocess
code pretty well... and I should be aware of any patches I've made for Windows. One thing to note about multiprocess
is that the source code is different for each version of python, as well as there are portions that are different for each platform.
I haven't look into the multiprocess
source code yet. That was more of a list of places that I would plan to look next rather than a particular fix that I was suggesting.
Dear anivegesana,
adding dill.settings['recurse'] = True
does not make any difference.
Windows Version seems not to matter - I have exactly the same behaviour on github actions
and on my local
machine with 8 cores and windows11 ...
yours sincerely
Robert
The mods to multiprocess
from multiprocessing
are pretty small, so generally if there's an issue in multiprocess
, it's also in the STL multiprocessing
as well.
@bitranox: Can you specify a matrix of versions of python, dill
, multiprocess
and wrapt
that you see the issue in, and where you don't?
Dear,mmckerns, that matrix seems easy - it worked all the way down since 2017, until now, an Windows7 / Windows10 / Windows 11, with Python 3.7/3.8/3.9/3.10 , o the difference is really only the new dill version ...
I guess to roll back the changes gradually from 3.0.5 to 3.0.4 should be the easiest way !
yours sincerely Robert
Thank you for your patience with this bug.
@bitranox: can you generate a matrix, even if it's easy. I haven't seen a quote on the versions of multiprocess
or wrapt
that show the issue. I think we are all assuming that a change to dill
is what triggered the issue. However, if you can explore version combinations a bit, it might help diagnose what's going on. @anivegesana: stepping through the commits to dill
(starting with the two quoted above) is probably the right thing to do.
Dear anivegesana, https://github.com/uqfoundation/dill/commit/04aba03432955f64c05441ffe0553af987c087ff is working
Dear,mmckerns, creating the matrix would make a lot of work and no benefit. I found no combination until now, working with 0.3.5, and I havent found any combination NOT working with 0.3.4
I really try to make it easy for You folks, but lets concentrate on something that looks more beneficial:
stepping through the commits to dill (starting with the two quoted above) is probably the right thing to do.
Yours sincerely Robert
Ok. Not expected. Most other issues from this week were concerning that PR. That narrows it down significantly. How about e2831d04ae64d6dc8686a3c5ed1eb703dc748d43?
d2f916c539967150ae21240b7c2922023cd215ac?
Ok. It is somewhere between. 23c47455da62d4cb8582d8f98f1de9fc6e0971ad?
5bd56a8e0096c3f45f4c37ff0b67ac6aa341f22b?
Last one: 914d47f7f2c2d7fb0d91ff7c3e48a9c6a22a2b0a? Then we know for sure.
https://github.com/uqfoundation/dill/commit/914d47f7f2c2d7fb0d91ff7c3e48a9c6a22a2b0a NOT Working lets double check - which one was the commit before that ?
The parent of that is e2831d04ae64d6dc8686a3c5ed1eb703dc748d43. So I know what to look for now. Will get back to you with a fix. Thank you for your patience and help with debugging this!
Yes, double checked, https://github.com/uqfoundation/dill/commit/e2831d04ae64d6dc8686a3c5ed1eb703dc748d43 IS working.
Ok. I found out that running the test suite with PyTest instead of just running the test manually lets me reproduce your error. This is not an issue with your library or multiprocess, it is an issue between dill and PyTest like #482. The error message I was getting back up here was because the console was implicitly __main__
, which would break multiprocess anyway.
Now that I understand multiprocess and this issue, I can be more useful. ~It is probably not exactly the same as #482, so I'll do some more poking to see if it is solved by #486 or not.~ As far as I can tell, it is exactly the same as #482 and the tests pass. Rest assured, a fix will be available in dill 0.3.5.2 (if that is still on @mmckerns' radar in June, else dill 0.3.6).
Dear anivegesana, that sounds great, thank You for Your hard work. Yours sincerely Robert
Yes, we should expect dill
to emit a new release in June or soon thereafter. The milestone is tagged as dill-0.3.6
. If you want to apply it to this and other PRs, go ahead. @anivegesana if you are sure it's the case, then we should tag and close this issue as a duplicate of #482.
And thank you both for being patient. This was a very frustrating bug that kept slipping out from underneath my nose due to all of the moving parts and because I kept searching in the wrong places.
I don't think I have permissions to do that. If it would be more convenient for me to do that because of regular contributions, I believe that the role is triage:
It is hopefully broad enough to allow contributors to help with managing the repo while not being expressive enough to cause a dominictarr/event-stream situation because there is no way to change the code on master without going through you.
dear @anivegesana ,
I am the author of the wrapt_timeout_decorator
since version 0.3.5 of dill, two of our tests are failing under windows.
In those tests we use our decorator to decorate some Class Methods, which are then called in a process via
multiprocess
(which is using implicitely dill). (see this test )Those tests are just fine with version
0.3.4
on windows/linux/macos, or0.3.5.x
on linux/macos, but the fail with dill >0.3.4
on windows.What seems to happen is, that the decorator recursively seem to call itself, creating a chain of subprocesses. In order to debug that behaviour You might set the timeout to 3600 seconds in that tests and print some debug information to
stderr
.I really struggle to make a minimal version to show that regression, therefore I would please ask You for help, in order to be able to use dill 0.3.5 upwards for my community.
you can find the failed jobs with dill 0.3.5.1 here and the same tests passing with dill 0.3.4.0 here
yours sincerely
Robert, Vienna