pleiszenburg / zugbruecke

Calling routines in Windows DLLs from Python scripts running under Linux, MacOS or BSD
https://zugbruecke.readthedocs.io/en/latest/
GNU Lesser General Public License v2.1
108 stars 11 forks source link

How to find reason for session-server time out #78

Open stefnet00 opened 2 years ago

stefnet00 commented 2 years ago

Hello all, I have a Windows DLL file running on a CentOS host using zurbruecke. Unfortunately I very often get a time out when starting a session. Is there a good way to find the reason for this? The same setup on OpenSUSE Leap is much more stable.

The logging of zugbrucke is already active:

cat ~/.zugbruecke/.zugbruecke.json 
{"log_write": true, "log_level": 10, "wineprefix": "warn+all"}
cat ~/work/UNIX_out.txt 
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.0, "pipe": "out", "cnt": "[session-client] STARTING (STAGE 1) ..."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.02, "pipe": "out", "cnt": "[session-client] Configured Wine-Python version is 3.5.3 for win32."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.02, "pipe": "out", "cnt": "[session-client] Log socket port: 51634."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.03, "pipe": "out", "cnt": "[session-client] STARTED (STAGE 1)."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.03, "pipe": "out", "cnt": "[session-client] STARTING (STAGE 2) ..."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.04, "pipe": "out", "cnt": "[interpreter] STARTING ..."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.04, "pipe": "out", "cnt": "[interpreter] Command: wine ~/.local/lib/python3.6/site-packages/zugbruecke/_server_.py --id 1cef5750 --port_socket_wine 54478 --port_socket_unix 51634 --log_level 10 --log_write 1"}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.07, "pipe": "out", "cnt": "[interpreter] Started with PID 1653."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.08, "pipe": "out", "cnt": "[interpreter] Logging threads started."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.08, "pipe": "out", "cnt": "[interpreter] STARTED."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776808.09, "pipe": "out", "cnt": "[session-client] Waiting for session-server to be up ..."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776838.1, "pipe": "out", "cnt": "[session-client] ... wait timed out (after 30.01 seconds)."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776838.14, "pipe": "out", "cnt": "[session-client] TERMINATING ..."}
{"level": 1, "platform": "UNIX", "id": "1cef5750", "time": 1635776838.14, "pipe": "out", "cnt": "[session-client] TERMINATED."}

A WINE_out.txt is not written when timed out. Only if start up was successful, see below.

$ cat ~/work/UNIX_out.txt 
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.28, "pipe": "out", "cnt": "[session-client] STARTING (STAGE 1) ..."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.28, "pipe": "out", "cnt": "[session-client] Configured Wine-Python version is 3.5.3 for win32."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.28, "pipe": "out", "cnt": "[session-client] Log socket port: 33342."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.29, "pipe": "out", "cnt": "[session-client] STARTED (STAGE 1)."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.29, "pipe": "out", "cnt": "[session-client] STARTING (STAGE 2) ..."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.29, "pipe": "out", "cnt": "[interpreter] STARTING ..."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.3, "pipe": "out", "cnt": "[interpreter] Command: wine ~/.zugbruecke/win32-python3.5.3/python.exe ~/.local/lib/python3.6/site-packages/zugbruecke/_server_.py --id f49d1ee4 --port_socket_wine 59266 --port_socket_unix 33342 --log_level 10 --log_write 1"}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.31, "pipe": "out", "cnt": "[interpreter] Started with PID 25984."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.31, "pipe": "out", "cnt": "[interpreter] Logging threads started."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.31, "pipe": "out", "cnt": "[interpreter] STARTED."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777549.31, "pipe": "out", "cnt": "[session-client] Waiting for session-server to be up ..."}
{"cnt": "[session-server] STARTING ...", "level": 1, "time": 1635777555.08, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[mp-server] STARTING ...", "level": 1, "time": 1635777555.1, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[mp-server] Log attached.", "level": 1, "time": 1635777555.11, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[mp-server] STARTED.", "level": 1, "time": 1635777555.12, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[session-server] ctypes server is listening on port 59266.", "level": 1, "time": 1635777555.13, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[session-server] STARTED.", "level": 1, "time": 1635777555.14, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[session-server] Serve forever ...", "level": 1, "time": 1635777555.15, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777555.17, "pipe": "out", "cnt": "[session-client] ... session server is up (after 5.85 seconds)."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777555.22, "pipe": "out", "cnt": "[session-client] STARTED (STAGE 2)."}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777555.22, "pipe": "out", "cnt": "[session-client] Attaching to DLL file \"~/work/WIN.dll\" with calling convention \"cdll\" ..."}
{"cnt": "[session-server] Attaching to DLL file \"~/work/WIN.dll\" with calling convention \"cdll\" ...", "level": 1, "time": 1635777555.22, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"cnt": "[session-server] ... attached.", "level": 1, "time": 1635777555.31, "platform": "WINE", "pipe": "out", "id": "f49d1ee4"}
{"level": 1, "platform": "UNIX", "id": "f49d1ee4", "time": 1635777555.33, "pipe": "out", "cnt": "[session-client] ... attached."}
s-m-e commented 2 years ago

Hi @stefnet00. To be honest, the original mechanism is extremely fragile and hard to debug (aka crap). I completely redesigned it, but have not yet found time to release it. Can you test the develop branch and see if you get better / more stable results? Do not forget to have a look at the change logs for v0.1.0 - there are some breaking API changes as well.

s-m-e commented 2 years ago

To answer your actual question, the "old" (current release) implementation does not capture all of the Wine process' stdout and stderr output. What you are seeing on CentOS is likely some crash on the Wine side. Python on top of Wine or Wine itself for some reason or a segmentation fault due to your DLL or some 32bit vs 64bit problem. The Unix side, i.e. your application on Linux, waits for the crashed process to appear and times out - and because of its inability to capture output properly, you can not see this in the logs as you would expect. There is no proper way to debug this.

The development branch has a proper thread- & queue-based implementation for capturing the output of the Wine process. You can expect to get all of it. The launch of the Wine process itself does not rely on some shaky Bash-script mechanism anymore but instead uses os.execv, which is a lot simpler and much more reliable.

stefnet00 commented 2 years ago

Thanks @s-m-e. I switched to the develop branch and made some changes in my script. However, I also had to modify some lines in zugbruecke:

core/wenv.py: On both, OpenSUSE and centOS I had to correct the name of the unix_egg_path and wine_egg_path. On both distributions they are named zugbruecke-0.1.0.dist-info instead of zugbruecke.egg-info. (Installed them via pip3 --user git+https://github.com/pleiszenburg/zugbruecke.git@develop)

When activating the log after ìmport zugbruecke.ctypes using ctypes._zb_set_parameter('log_write', True) a write error in core/log.py:136 occured due to the fact that self._f was still None. I changed the __init__ to define self._f in any case.

However, I can not say that my setup is running more stable now. Somehow importing zugbruecke fails on some nodes of the cluster.

    import zugbruecke.ctypes as ctypes
  File "~/.local/lib/python3.6/site-packages/zugbruecke/ctypes/__init__.py", line 44, in <module>
    _session = _CtypesSession()
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session.py", line 88, in __init__
    self._zb_current_session = SessionClient(config=config)
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 124, in __init__
    self._wait_for_server_status_change(target_status=True)
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 338, in _wait_for_server_status_change
    raise TimeoutError("session server did not appear")
TimeoutError: session server did not appear
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 261, in terminate
    self._terminate_on_server()
AttributeError: 'SessionClient' object has no attribute '_terminate_on_server'

And the zugbruecke log

(1635948257.87/91855baf) UNIX o: [session-client] STARTING ...
(1635948258.51/91855baf) UNIX o: [session-client] Configured Wine-Python version is 3.7.4 for win32.
(1635948259.25/91855baf) UNIX o: [session-client] Log socket port: 59931.
(1635948259.30/91855baf) UNIX o: [interpreter] STARTING ...
(1635948259.56/91855baf) UNIX o: [interpreter] Command: -m zugbruecke._server_ --id 91855baf --port_socket_wine 59275 --port_socket_unix 59931 --log_level 10 --log_write 1 --timeout_start 30
(1635948259.57/91855baf) UNIX o: [interpreter] Environment: {[...], 'WENV_ARCH': 'win32', 'WENV_PYTHONVERSION': '3.7.4'}
(1635948259.58/91855baf) UNIX o: [interpreter] Started with PID 25823.
(1635948259.59/91855baf) UNIX o: [interpreter] Starting stream reader threads ...
(1635948259.59/91855baf) UNIX o: [interpreter] Stream reader threads started.
(1635948259.59/91855baf) UNIX o: [interpreter] Starting log processing thread ...
(1635948259.60/91855baf) UNIX o: [interpreter] Log processing thread started.
(1635948259.60/91855baf) UNIX o: [interpreter] STARTED.
(1635948269.60/91855baf) UNIX o: [session-client] Waiting for session-server to be up ...
(1635948299.61/91855baf) UNIX o: [session-client] ... wait timed out (after 30.00 seconds).
(1635948299.63/91855baf) UNIX o: [session-client] TERMINATING ...

I still don't know why the session server does not start up correctly. Do you have any hint how to get that out?

s-m-e commented 2 years ago

Hey thanks for trying. You ran into a fair number of distinct bugs ... Most importantly, the code responsible for catching a failed Wine process does not reliably catch anything, apparently. This needs fixing, but I am lacking sufficient time at the moment. Hence two quick and dirty ideas:

1) You can get some proper output as follows:

2) Can you share a reproducible example of some sort? A VM or docker image perhaps?

stefnet00 commented 2 years ago

Well, I tried to make the changes you suggested but could not get more outputs from wine. _read_stream did not change anything and I could see any call of this method either. (I inserting a simple print() at the beginning of it.) Unfortunately unbuffer is not available on the cluster nodes. Hence I only set the parameter bufsize = 0. But it did not give any further outputs.

Second I was checking a minimal setup which would be able to run in a VM. When trying this I could not reproduce the error... Don't know exactly why. For now the only way is getting the output from wine when it's running into a time out.

I think it's less effort at the moment to start the job a couple of times until one is running. Thanks for your quick answers and ideas on solving it. I'll stay in touch.

stefnet00 commented 2 years ago

@s-m-e: I did some more investigations on the reason of the time out. After many different tests and tries to catch the output of subprocess.Popen there was surprising effect.

Because I could catch the output of the original subprocess.Popen I inserted a dummy process in line 110:

# Fire up test
self._log.out("[interpreter] Fire up dummy process")
subprocess.call(['wenv', 'python', '-c', 'print("foo")'])

After that, all jobs start properly. Before only around one out of ten jobs started successfully.

I really don't know why..

s-m-e commented 2 years ago

Thanks for the feedback. I might have a thin idea, but I'd really like to reproduce your environment first. How much of its configuration can you share? I remember CentOS - which version? x86_64, I assume? Patch-level? Any customization deviating from CentOS' original packages? Which Wine version/build/repo? Python build/version? Are there containers (Docker and friends) involved? What file system(s) is this running on (locations of Python, Python standard library, Python site-packages, /tmp, /home)? Are there read-only file systems and/or overlay filesystems involved? What kind of user account with what kind of rights is running this?

s-m-e commented 2 years ago

I did a fair bit of cleanup in both zugbruecke and wenv recently. There were some oddities with respect to handling environment variables that I got rid of. Could you please check what happens if you run on the current develop branches (as of now, just pushed changes) of both packages?

s-m-e commented 2 years ago

I added a few configuration options to wenv which should be useful to you. I also cleaned up configuration in zugbruecke so there is a somewhat more clean division of labor between both packages.

stefnet00 commented 2 years ago

Thanks for the feedback. I might have a thin idea, but I'd really like to reproduce your environment first. How much of its configuration can you share? I remember CentOS - which version? x86_64, I assume? Patch-level? Any customization deviating from CentOS' original packages? Which Wine version/build/repo? Python build/version? Are there containers (Docker and friends) involved? What file system(s) is this running on (locations of Python, Python standard library, Python site-packages, /tmp, /home)? Are there read-only file systems and/or overlay filesystems involved? What kind of user account with what kind of rights is running this?

OK, tanks for responding so quickly.

Some information about the system:

$ rpm -q centos-release
centos-release-7-9.2009.1.el7.centos.x86_64

It's a bit difficult for me to give some more information on all the packages involved. A major part ist slurm (version 19.05.5) which is used to schedule the jobs. Starting the jobs directly without slurm was much more reliable than using slurm. Moreover it was more reliable in a minimal working example than within the quite complex simulation environment based on numerous of Python packages.

s-m-e commented 2 years ago

Just briefly: Configuration at time of creation, i.e.

from zugbruecke import CtypesSession

ctypes_368 = CtypesSession(pythonversion = "3.6.8")
stefnet00 commented 2 years ago

I have upgraded to the latest develop version. Nothing changes regarding the reliability of the job. Using the dummy call still increases the reliability drastically. But I got additional output when a job fails. It's not send using the default logging of zugbruecke, it's just plain, at the beginning:

[...]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
OSError: [WinError 6] Invalid handle

Current thread 0x00000009 (most recent call first):
[...]
Traceback (most recent call last):
[...]
    import zugbruecke.ctypes as ctypes
  File "~/.local/lib/python3.6/site-packages/zugbruecke/ctypes/__init__.py", line 44, in <module>
    _session = _CtypesSession()
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session.py", line 129, in __init__
    self._current_session = SessionClient(config=Config(**kwargs))
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 124, in __init__
    self._wait_for_server_status_change(target_status=True)
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 339, in _wait_for_server_status_change
    raise TimeoutError("session server did not appear")
TimeoutError: session server did not appear
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 257, in terminate
    self._terminate_on_server()
AttributeError: 'SessionClient' object has no attribute '_terminate_on_server'
[...]
(1638369973.77/224c5f24) UNIX o: [session-client] STARTING ...
(1638369973.78/224c5f24) UNIX o: [session-client] Configured Wine-Python version is 3.7.4.stable for win32.
(1638369973.78/224c5f24) UNIX o: [session-client] Log socket port: 46602.
(1638369973.84/224c5f24) UNIX o: [interpreter] STARTING ...
(1638369973.84/224c5f24) UNIX o: [interpreter] Command: -m zugbruecke._server_ --id 224c5f24 --port_socket_wine 42637 --port_socket_unix 46602 --log_level 10 --log_write 1 --timeout_start 30
(1638369973.85/224c5f24) UNIX o: [interpreter] Environment: {[...], 'SLURM_MEM_PER_NODE': '25000', '_': '/usr/bin/python3', 'ZUGBRUECKE_LOG_WRITE': 'True', 'ZUGBRUECKE_LOG_LEVEL': '10', 'WENV_ARCH': 'win32', 'WENV_WINE_BIN_WIN32': 'wine', 'WENV_WINE_BIN_WIN64': 'wine64', 'WENV_WINE_BIN_ARM64': 'wine', 'WENV_WINEDEBUG': '-all', 'WENV_WINEINSTALLPREFIX': 'None', 'WENV_PREFIX': '~/.local', 'WENV_WINEPREFIX': '~/.local/share/wenv/win32', 'WENV_PYTHONPREFIX': '~/.local/share/wenv/win32/drive_c/python-3.6.8.stable', 'WENV_OFFLINE': 'False', 'WENV_CACHE': '~/.local/share/wenv/cache', 'WENV_PACKAGES': '~/.local/share/wenv/cache/packages', 'WENV__ISSUES_50_WORKAROUND': 'False'}
(1638369973.85/224c5f24) UNIX o: [interpreter] Fire up dummy process
(1638369975.13/224c5f24) UNIX o: [interpreter] Started with PID 18935.
(1638369975.15/224c5f24) UNIX o: [interpreter] Starting stream reader threads ...
(1638369975.15/224c5f24) UNIX o: [interpreter] Stream reader threads started.
(1638369975.15/224c5f24) UNIX o: [interpreter] Starting log processing thread ...
(1638369975.15/224c5f24) UNIX o: [interpreter] Log processing thread started.
(1638369975.16/224c5f24) UNIX o: [interpreter] STARTED.
(1638369975.16/224c5f24) UNIX o: [session-client] Waiting for session-server to be up ...
(1638369977.02/224c5f24) WINE o: [session-server] STARTING ...
(1638370005.18/224c5f24) UNIX o: [session-client] ... wait timed out (after 30.01 seconds).
(1638370010.21/224c5f24) UNIX o: [session-client] TERMINATING ...

PS: After upgrading I still hat to do the changes mention here.

stefnet00 commented 2 years ago

Or with an additional error output from zugbruecke

Fatal Python error: Py_Initialize: can't initialize sys standard streams
OSError: [WinError 6] Invalid handle

Current thread 0x00000009 (most recent call first):
(1638374394.69/4ff44657) UNIX e: [P] Fatal Python error: failed to get random numbers to initialize Python
(1638374394.69/4ff44657) UNIX e: [P] 
s-m-e commented 2 years ago

Ok, we're getting closer. This is something that I have only seen on Ubuntu 14.04 / Trusty Tahr. It's related to an "outdated" library, if I am not mistaken, and some issues related to symlinks, but I never fully identified the root cause.

s-m-e commented 2 years ago

See #50. For a while, I had a work-around option named _issues_50_workaround in wenv. I thought I did not need it anymore, so I removed it ... I guess I should add it again. Is this Wine 4 or even earlier?

stefnet00 commented 2 years ago

Yes, it's wine 4.0.4

s-m-e commented 2 years ago

Ok, even closer. Can you try to upgrade to at least Wine 4.16 (or 5 or even 6)? Alternatively (or in addition), can you try to install zugbruecke into a path where no single element (folder name etc) starts with a dot (.)? In your case, it is sitting underneath .local, which is a Unix-hidden folder, i.e. starting with a dot.

s-m-e commented 2 years ago

Sorry, at least Wine 4.17.

stefnet00 commented 2 years ago

Ok, updating certifi helped a lot. Initially it was version 2019.11.28. After updating I have 2021.10.08 and much more jobs are starting (without using the dummy process).

stefnet00 commented 2 years ago

Updating wine is not that easy to me. I would have to ask and it has already been much work to install this version wine.

s-m-e commented 2 years ago

Just looked up your CentOS version (7.9). Its tool-chain dates to 2014, just like in Trusty Tahr, so getting a younger version of Wine wont do the trick (it will even be prohibitively complicated to make it run in the first place). This also explains why certifi helps - your systems' SSL certificate stores are outdated, mildly put.

Test the thing with the dots in the path. If this does not work, I have one more idea. zugbruecke uses symlinks. One can remove them and use hard links or even copies of certain folders ... It's not ideal ...

stefnet00 commented 2 years ago

Just moving the package directory of zugbruecke did not work directly. Seems I need to modify some paths but do not have any more time now. I'll do it later.

Thanks a lot for all the help.

s-m-e commented 2 years ago

I just replicated part of your setup in a VM, CentOS 7.9 x86_64, python 3.6.8, Wine 4.0.4. It's an old system, lots of stuff to work around, and distinct issues in zugbruecke if I see correctly. I will report back to you when I have worked my way through them.

stefnet00 commented 2 years ago

I did some test without any dot in the package path of zugbruecke and wenv and no dots in the WENV_WINEPREFIX. No major difference regarding the reliability. However, I found a discrepancy in the python version:

>>> ctypes.zb_get_parameter('pythonversion')
<Python 3.7.4.stable (win32)>
>>> subprocess.check_output(['wenv', 'python', '-c', 'import sys; print(sys.version)'])
b'3.6.8 (tags/v3.6.8:3c6b436a57, Dec 23 2018, 23:31:17) [MSC v.1916 32 bit (Intel)]\r\n'