joblib / loky

Robust and reusable Executor for joblib
http://loky.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
528 stars 45 forks source link

FIX support for frozen executable on all platforms #375

Open tomMoral opened 1 year ago

tomMoral commented 1 year ago

This is an attempt to fix loky for frozen support for all plateforms. (cc #236 , joblib/joblib#1002). (I got nerd snipped :sweat_smile: )

This basically adds a freeze_support function that allows starting the worker with the right executable if it is called right after if __name__ == "__main__":. This PR also cleanups and refactor the command line generation, to avoid some duplicated code.

I used the following scripts to test and reproduce the error in pure loky:

import argparse
from loky import get_reusable_executor, freeze_support

def main():
    arguments_parser = argparse.ArgumentParser()
    arguments_parser.add_argument("-n", default=None, type=int)
    flags, _ = arguments_parser.parse_known_args()

    # safe guard otherwise new processes are spawned forever
    if flags.n is None:
        raise RuntimeError('main() was called again.')

    exc = get_reusable_executor(2)
    f = exc.submit(print, 1)
    f.result()
    print("Correct end!!!!")

if __name__ == '__main__':
    freeze_support()
    main()
pyinstaller --noconfirm --log-level=WARN --onedir --nowindow test.py
dist/test/test -n 1

this gives me the output:

Traceback (most recent call last):
  File "test.py", line 22, in <module>
    main()
  File "test.py", line 12, in main
    raise RuntimeError('main() was called again.')
RuntimeError: main() was called again.
[551219] Failed to execute script 'test' due to unhandled exception!
1
Correct end!!!!

Note that the error is due to the fact that actually, the resource tracker from cpython is also broken. I am not totally sure why it is actually started but this is not a problem for running the code. If the code from this PR actually fixes the issue, it could be ported to python to make the frozen support working with all platforms/start methods and with the resource tracker.

As a disclaimer, I did not test this on windows and I would be very glad to have some feedbacks from the original reporters. Maybe from @samuelstjean or @gdoras if you are still around. Sorry for the slow response....

TODO before merge:

israeldi commented 1 year ago

This sounds awesome! Hoping that freeze_support can finally be added to loky and work with pyinstaller. Thank you so much @tomMoral ☺️

samuelstjean commented 1 year ago

Are we still needed here? I can probably try to get it running, although I scrapped my original windows vm since then, but we now have github actions instead, so probably something to do there. Can't remember if it was also a problem on other platforms too, but that seems likely.

ogrisel commented 1 year ago

I think we need an integration test that runs a minimal loky-based program through pyinstaller and check that it runs correctly.

And we also need a docstring for this: https://github.com/joblib/loky/pull/375/files#r1061777509.

ogrisel commented 1 year ago

I tried this PR with pyinstaller on the following test program named frozen_loky.py):

import loky

if __name__ == "__main__":
    loky.freeze_support()
    e = loky.get_reusable_executor()
    print(sum(e.map(lambda x: x ** 2, range(1000))))
$ pyinstaller --version
5.8.0
$ pyinstaller frozen_loky.py
$ ./sdist/frozen_loky/frozen_loky
332833500
$ 332833500
332833500
332833500
332833500
332833500
332833500
[...]

and it creates many subprocesses that continuously output the result on stdout even after the parent process has completed (I had to pkill frozen_loky to terminate them all).

I tried both at 95ae81c and after the merge of the master branch at 00c8ded and got the same behavior in both cases so I think the problem does not comes.

This is running:

$ python -V
Python 3.10.2

on Linux. Note: I cannot get pyinstaller to work on my local macOS machine because of the problem with the codesign command.

ogrisel commented 1 year ago

Hum, I re-read the original description of the issue and maybe parsing the command line args to raise the RuntimeError is actually needed? This seems not very user friendly.

ogrisel commented 1 year ago

Hum, I re-read the original description of the issue and maybe parsing the command line args to raise the RuntimeError is actually needed? This seems not very user friendly.

I tried just that and could not get it to work either. Or maybe it's working as expected but the error messages on stderr are not expected.

ogrisel commented 1 year ago

Thanks @tomMoral! I confirm that the latest commit make my minimal snippet from https://github.com/joblib/loky/pull/375#issuecomment-1438287445 work fine!

ogrisel commented 1 year ago

Summary of the failures of my new test:

I will try to see if I can link some of those problems to known issues in the pyinstaller issue tracker.

samuelstjean commented 1 year ago

A few things from experience:

ogrisel commented 1 year ago

Thanks for the feedback. Note that the warnings I mentioned above when running pyinstaller under tox under conda env do not show up on the CI, yet we still get the segfaults under some macos and linux jobs.

So the warnings are probably harmless and unrelated. I will try to craft a few minimal CI envs to specifically run this test and remove the pyinstaller dependency from the tox.ini file so that it's not run from the other envs.

ogrisel commented 1 year ago

The Windows test seems to have uncovered a real problem in the concurrent future queues:

loky.process_executor._RemoteTraceback: 

"""
Traceback (most recent call last):
  File "loky\process_executor.py", line 426, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "multiprocessing\queues.py", line 108, in get
PermissionError: [WinError 5] Access is denied
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "frozen_loky.py", line 7, in <module>
  File "loky\process_executor.py", line 962, in _chain_from_iterable_of_lists
    for element in iterable:
  File "concurrent\futures\_base.py", line 621, in result_iterator
  File "concurrent\futures\_base.py", line 319, in _result_or_cancel
  File "concurrent\futures\_base.py", line 403, in __get_result
loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
ogrisel commented 1 year ago

In 27e0ad7, I changed the frozen program to use concurrent.futures.ProcessPoolExecutor and multiprocessing.freeze_support and it works without error on all 3 supported platforms.

ogrisel commented 1 year ago

The error observed under windows look similar to what was fixed by @pierreglaser in #216 but comparing with the way the executable is introspected in the current version of those files (spawn.py, popen_loky_win32.py and resource_tracker.py) I cannot spot the discrepancy (but I did not attempt to perform a full code diff either).

ogrisel commented 1 year ago

I reverted my last two commits, I don't have a windows machine at hand so it's really painful to debug via the CI. I should try to setup a windows vm on another local machine but I won't have time today unfortunately.

ogrisel commented 1 year ago

The macos failure (TestResourceTracker.test_resource_tracker[folder]) is random an unrelated to this PR.

ogrisel commented 1 year ago

For information, I tried to reproduce locally on a Windows VM and I can reproduce both with Python 3.11, 3.10 and 3.9 so it's not specific to a given Python version (as I suspected after seeing the seemingly related failure in #389) but rather a bug in loky itself.

ogrisel commented 1 year ago

Some of the latest changes cause the resource tracker start-up to crash on all windows jobs:

Traceback (most recent call last):
  File "X:\loky\loky\backend\spawn.py", line 283, in main
    exitcode = _main(fd, parent_sentinel=parent_sentinel)
  File "X:\loky\loky\backend\spawn.py", line 301, in _main
    prepare(preparation_data)
  File "X:\loky\loky\backend\spawn.py", line 195, in prepare
    _resource_tracker._fd = msvcrt.open_osfhandle(handle, 0)
OSError: [Errno 9] Bad file descriptor
tomMoral commented 1 year ago

Yes this is expected. The strategy put in place by multiprocessing is to not use inheritance anymore but to use duplication. I changed it for the main process but I also need to adapt the ressource_tracker strategy now that I understood the inheritance. hopefully I will get the time to investigate this by tomorrow.

tomMoral commented 1 year ago

After (too much) investigation, I think this is actually related to pyinstaller and the way the frozen programs are created that mess around with the permission of the worker process (see pyinstaller/pyinstaller#7183 and https://github.com/pyinstaller/pyinstaller/blob/develop/PyInstaller/hooks/rthooks/pyi_rth_multiprocessing.py)

Not sure how to fix this yet...

ogrisel commented 1 year ago

It's weird, I tried to use concurrent.futures.ProcessPoolExecutor / multiprocessing.freeze_support in the pyinstaller test of this PR and it passes. There must be something specific to the spawn variant implemented in loky that causes this problem.

ogrisel commented 1 year ago

For information, I tried to see if the refactoring in this PR would potentially fix the SIGSEGV(-11) crash observed on macOS in #388 but unfortunately this does not solve it.

tomMoral commented 1 year ago

It's weird, I tried to use concurrent.futures.ProcessPoolExecutor / multiprocessing.freeze_support in the pyinstaller test of this PR and it passes. There must be something specific to the spawn variant implemented in loky that causes this problem.

multiprocessing is monkeypatched but pyinstaller but loky is not so it is not weird that it fails on loky. The complex thing is, I don't get what part of the monkey patching I need ^^. I tried a bit but I had to move on. If I confirm this; then, we can xfail this test and add a hook for loky in pyinstaller to fix this and move on. But we have to make sure this is the case first...

ogrisel commented 1 year ago

+1 for opening an issue upstream on the pyinstaller issue tracker and xfailing the windows test case in loky in the mean time.

Please update the changelog accordingly to document that freeze support for windows is still a work in progress.

ogrisel commented 1 year ago

@tomMoral CI trigger

I believe you meant to merge or rebase with current master no?

ogrisel commented 1 year ago

Actually no need, because the latest CI config of master is used in the PR even if there are not up to date with master.

ogrisel commented 1 year ago

Unfortunately, this PR does not fix the TestsProcessPoolSpawnExecutor.test_max_depth[True] failure observed for Windows and Python 3.11.

ogrisel commented 1 year ago

There is another, maybe related failure in test_kill_process_tree. I opened #392.

ogrisel commented 1 year ago

Actually, the test_kill_process_tree failure in this PR is different from the one on master (as reported in #392). Here is the traceback:

================================== FAILURES ===================================
________________________ test_kill_process_tree[True] _________________________

use_psutil = True

    @pytest.mark.parametrize("use_psutil", [True, False])
    def test_kill_process_tree(use_psutil):
        psutil = pytest.importorskip("psutil")
        event = ctx_loky.Event()
        p = ctx_loky.Process(target=_run_nested_delayed, args=(4, 1000, event))
        p.start()

        # Wait for all the processes to be launched
        if not event.wait(30):
            kill_process_tree(p, use_psutil=use_psutil)
>           raise RuntimeError(
                "test_kill_process_tree was not able to launch "
                "all nested processes."
            )
E           RuntimeError: test_kill_process_tree was not able to launch all nested processes.

event      = <Event at 0x20e3e77e9d0 unset>
p          = <LokyProcess name='LokyProcess-20' pid=1792 parent=7720 stopped exitcode=15>
psutil     = <module 'psutil' from 'D:\\a\\1\\s\\.tox\\py311\\Lib\\site-packages\\psutil\\__init__.py'>
use_psutil = True

tests\test_loky_backend.py:721: RuntimeError
---------------------------- Captured stdout call -----------------------------

--------------------------------------------------------------------------------
None failed with traceback: 
--------------------------------------------------------------------------------

Traceback (most recent call last):
  File "D:\a\1\s\loky\backend\spawn.py", line 298, in main
    exitcode = _main(fd, parent_sentinel=parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\a\1\s\loky\backend\spawn.py", line 317, in _main
    self = pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input