Open stefnet00 opened 2 years ago
Hi @stefnet00. To be honest, the original mechanism is extremely fragile and hard to debug (aka crap). I completely redesigned it, but have not yet found time to release it. Can you test the develop
branch and see if you get better / more stable results? Do not forget to have a look at the change logs for v0.1.0 - there are some breaking API changes as well.
To answer your actual question, the "old" (current release) implementation does not capture all of the Wine process' stdout
and stderr
output. What you are seeing on CentOS is likely some crash on the Wine side. Python on top of Wine or Wine itself for some reason or a segmentation fault due to your DLL or some 32bit vs 64bit problem. The Unix side, i.e. your application on Linux, waits for the crashed process to appear and times out - and because of its inability to capture output properly, you can not see this in the logs as you would expect. There is no proper way to debug this.
The development branch has a proper thread- & queue-based implementation for capturing the output of the Wine process. You can expect to get all of it. The launch of the Wine process itself does not rely on some shaky Bash-script mechanism anymore but instead uses os.execv, which is a lot simpler and much more reliable.
Thanks @s-m-e. I switched to the develop
branch and made some changes in my script. However, I also had to modify some lines in zugbruecke:
core/wenv.py
: On both, OpenSUSE and centOS I had to correct the name of the unix_egg_path
and wine_egg_path
. On both distributions they are named zugbruecke-0.1.0.dist-info
instead of zugbruecke.egg-info
. (Installed them via pip3 --user git+https://github.com/pleiszenburg/zugbruecke.git@develop
)
When activating the log after ìmport zugbruecke.ctypes
using ctypes._zb_set_parameter('log_write', True)
a write error in core/log.py:136
occured due to the fact that self._f
was still None
. I changed the __init__
to define self._f
in any case.
However, I can not say that my setup is running more stable now. Somehow importing zugbruecke fails on some nodes of the cluster.
import zugbruecke.ctypes as ctypes
File "~/.local/lib/python3.6/site-packages/zugbruecke/ctypes/__init__.py", line 44, in <module>
_session = _CtypesSession()
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session.py", line 88, in __init__
self._zb_current_session = SessionClient(config=config)
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 124, in __init__
self._wait_for_server_status_change(target_status=True)
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 338, in _wait_for_server_status_change
raise TimeoutError("session server did not appear")
TimeoutError: session server did not appear
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 261, in terminate
self._terminate_on_server()
AttributeError: 'SessionClient' object has no attribute '_terminate_on_server'
And the zugbruecke log
(1635948257.87/91855baf) UNIX o: [session-client] STARTING ...
(1635948258.51/91855baf) UNIX o: [session-client] Configured Wine-Python version is 3.7.4 for win32.
(1635948259.25/91855baf) UNIX o: [session-client] Log socket port: 59931.
(1635948259.30/91855baf) UNIX o: [interpreter] STARTING ...
(1635948259.56/91855baf) UNIX o: [interpreter] Command: -m zugbruecke._server_ --id 91855baf --port_socket_wine 59275 --port_socket_unix 59931 --log_level 10 --log_write 1 --timeout_start 30
(1635948259.57/91855baf) UNIX o: [interpreter] Environment: {[...], 'WENV_ARCH': 'win32', 'WENV_PYTHONVERSION': '3.7.4'}
(1635948259.58/91855baf) UNIX o: [interpreter] Started with PID 25823.
(1635948259.59/91855baf) UNIX o: [interpreter] Starting stream reader threads ...
(1635948259.59/91855baf) UNIX o: [interpreter] Stream reader threads started.
(1635948259.59/91855baf) UNIX o: [interpreter] Starting log processing thread ...
(1635948259.60/91855baf) UNIX o: [interpreter] Log processing thread started.
(1635948259.60/91855baf) UNIX o: [interpreter] STARTED.
(1635948269.60/91855baf) UNIX o: [session-client] Waiting for session-server to be up ...
(1635948299.61/91855baf) UNIX o: [session-client] ... wait timed out (after 30.00 seconds).
(1635948299.63/91855baf) UNIX o: [session-client] TERMINATING ...
I still don't know why the session server does not start up correctly. Do you have any hint how to get that out?
Hey thanks for trying. You ran into a fair number of distinct bugs ... Most importantly, the code responsible for catching a failed Wine process does not reliably catch anything, apparently. This needs fixing, but I am lacking sufficient time at the moment. Hence two quick and dirty ideas:
1) You can get some proper output as follows:
line
variable is of type bytes
. Extract it somehow, perhaps by writing to a file.["wenv", "python", "-u"] + self._p["server_cli_params"]
to ["unbuffer", "wenv", "python", "-u"] + self._p["server_cli_params"]
, also see here. You can also, in addition, set the bufsize
parameter of Popen
to 0
, see here.2) Can you share a reproducible example of some sort? A VM or docker image perhaps?
Well, I tried to make the changes you suggested but could not get more outputs from wine. _read_stream
did not change anything and I could see any call of this method either. (I inserting a simple print()
at the beginning of it.) Unfortunately unbuffer
is not available on the cluster nodes. Hence I only set the parameter bufsize = 0
. But it did not give any further outputs.
Second I was checking a minimal setup which would be able to run in a VM. When trying this I could not reproduce the error... Don't know exactly why. For now the only way is getting the output from wine when it's running into a time out.
I think it's less effort at the moment to start the job a couple of times until one is running. Thanks for your quick answers and ideas on solving it. I'll stay in touch.
@s-m-e: I did some more investigations on the reason of the time out. After many different tests and tries to catch the output of subprocess.Popen
there was surprising effect.
Because I could catch the output of the original subprocess.Popen
I inserted a dummy process in line 110:
# Fire up test
self._log.out("[interpreter] Fire up dummy process")
subprocess.call(['wenv', 'python', '-c', 'print("foo")'])
After that, all jobs start properly. Before only around one out of ten jobs started successfully.
I really don't know why..
Thanks for the feedback. I might have a thin idea, but I'd really like to reproduce your environment first. How much of its configuration can you share? I remember CentOS - which version? x86_64, I assume? Patch-level? Any customization deviating from CentOS' original packages? Which Wine version/build/repo? Python build/version? Are there containers (Docker and friends) involved? What file system(s) is this running on (locations of Python, Python standard library, Python site-packages, /tmp, /home)? Are there read-only file systems and/or overlay filesystems involved? What kind of user account with what kind of rights is running this?
I did a fair bit of cleanup in both zugbruecke
and wenv
recently. There were some oddities with respect to handling environment variables that I got rid of. Could you please check what happens if you run on the current develop
branches (as of now, just pushed changes) of both packages?
I added a few configuration options to wenv which should be useful to you. I also cleaned up configuration in zugbruecke so there is a somewhat more clean division of labor between both packages.
Thanks for the feedback. I might have a thin idea, but I'd really like to reproduce your environment first. How much of its configuration can you share? I remember CentOS - which version? x86_64, I assume? Patch-level? Any customization deviating from CentOS' original packages? Which Wine version/build/repo? Python build/version? Are there containers (Docker and friends) involved? What file system(s) is this running on (locations of Python, Python standard library, Python site-packages, /tmp, /home)? Are there read-only file systems and/or overlay filesystems involved? What kind of user account with what kind of rights is running this?
OK, tanks for responding so quickly.
Some information about the system:
$ rpm -q centos-release
centos-release-7-9.2009.1.el7.centos.x86_64
It's a bit difficult for me to give some more information on all the packages involved. A major part ist slurm
(version 19.05.5) which is used to schedule the jobs. Starting the jobs directly without slurm was much more reliable than using slurm
. Moreover it was more reliable in a minimal working example than within the quite complex simulation environment based on numerous of Python packages.
Just briefly: Configuration at time of creation, i.e.
from zugbruecke import CtypesSession
ctypes_368 = CtypesSession(pythonversion = "3.6.8")
I have upgraded to the latest develop
version. Nothing changes regarding the reliability of the job. Using the dummy call still increases the reliability drastically.
But I got additional output when a job fails. It's not send using the default logging of zugbruecke, it's just plain, at the beginning:
[...]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
OSError: [WinError 6] Invalid handle
Current thread 0x00000009 (most recent call first):
[...]
Traceback (most recent call last):
[...]
import zugbruecke.ctypes as ctypes
File "~/.local/lib/python3.6/site-packages/zugbruecke/ctypes/__init__.py", line 44, in <module>
_session = _CtypesSession()
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session.py", line 129, in __init__
self._current_session = SessionClient(config=Config(**kwargs))
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 124, in __init__
self._wait_for_server_status_change(target_status=True)
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 339, in _wait_for_server_status_change
raise TimeoutError("session server did not appear")
TimeoutError: session server did not appear
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "~/.local/lib/python3.6/site-packages/zugbruecke/core/session_client.py", line 257, in terminate
self._terminate_on_server()
AttributeError: 'SessionClient' object has no attribute '_terminate_on_server'
[...]
(1638369973.77/224c5f24) UNIX o: [session-client] STARTING ...
(1638369973.78/224c5f24) UNIX o: [session-client] Configured Wine-Python version is 3.7.4.stable for win32.
(1638369973.78/224c5f24) UNIX o: [session-client] Log socket port: 46602.
(1638369973.84/224c5f24) UNIX o: [interpreter] STARTING ...
(1638369973.84/224c5f24) UNIX o: [interpreter] Command: -m zugbruecke._server_ --id 224c5f24 --port_socket_wine 42637 --port_socket_unix 46602 --log_level 10 --log_write 1 --timeout_start 30
(1638369973.85/224c5f24) UNIX o: [interpreter] Environment: {[...], 'SLURM_MEM_PER_NODE': '25000', '_': '/usr/bin/python3', 'ZUGBRUECKE_LOG_WRITE': 'True', 'ZUGBRUECKE_LOG_LEVEL': '10', 'WENV_ARCH': 'win32', 'WENV_WINE_BIN_WIN32': 'wine', 'WENV_WINE_BIN_WIN64': 'wine64', 'WENV_WINE_BIN_ARM64': 'wine', 'WENV_WINEDEBUG': '-all', 'WENV_WINEINSTALLPREFIX': 'None', 'WENV_PREFIX': '~/.local', 'WENV_WINEPREFIX': '~/.local/share/wenv/win32', 'WENV_PYTHONPREFIX': '~/.local/share/wenv/win32/drive_c/python-3.6.8.stable', 'WENV_OFFLINE': 'False', 'WENV_CACHE': '~/.local/share/wenv/cache', 'WENV_PACKAGES': '~/.local/share/wenv/cache/packages', 'WENV__ISSUES_50_WORKAROUND': 'False'}
(1638369973.85/224c5f24) UNIX o: [interpreter] Fire up dummy process
(1638369975.13/224c5f24) UNIX o: [interpreter] Started with PID 18935.
(1638369975.15/224c5f24) UNIX o: [interpreter] Starting stream reader threads ...
(1638369975.15/224c5f24) UNIX o: [interpreter] Stream reader threads started.
(1638369975.15/224c5f24) UNIX o: [interpreter] Starting log processing thread ...
(1638369975.15/224c5f24) UNIX o: [interpreter] Log processing thread started.
(1638369975.16/224c5f24) UNIX o: [interpreter] STARTED.
(1638369975.16/224c5f24) UNIX o: [session-client] Waiting for session-server to be up ...
(1638369977.02/224c5f24) WINE o: [session-server] STARTING ...
(1638370005.18/224c5f24) UNIX o: [session-client] ... wait timed out (after 30.01 seconds).
(1638370010.21/224c5f24) UNIX o: [session-client] TERMINATING ...
PS: After upgrading I still hat to do the changes mention here.
Or with an additional error output from zugbruecke
Fatal Python error: Py_Initialize: can't initialize sys standard streams
OSError: [WinError 6] Invalid handle
Current thread 0x00000009 (most recent call first):
(1638374394.69/4ff44657) UNIX e: [P] Fatal Python error: failed to get random numbers to initialize Python
(1638374394.69/4ff44657) UNIX e: [P]
Ok, we're getting closer. This is something that I have only seen on Ubuntu 14.04 / Trusty Tahr. It's related to an "outdated" library, if I am not mistaken, and some issues related to symlinks, but I never fully identified the root cause.
See #50. For a while, I had a work-around option named _issues_50_workaround
in wenv
. I thought I did not need it anymore, so I removed it ... I guess I should add it again. Is this Wine 4 or even earlier?
Yes, it's wine 4.0.4
Ok, even closer. Can you try to upgrade to at least Wine 4.16 (or 5 or even 6)? Alternatively (or in addition), can you try to install zugbruecke
into a path where no single element (folder name etc) starts with a dot (.
)? In your case, it is sitting underneath .local
, which is a Unix-hidden folder, i.e. starting with a dot.
Sorry, at least Wine 4.17.
Ok, updating certifi
helped a lot. Initially it was version 2019.11.28
. After updating I have 2021.10.08
and much more jobs are starting (without using the dummy process).
Updating wine is not that easy to me. I would have to ask and it has already been much work to install this version wine.
Just looked up your CentOS version (7.9). Its tool-chain dates to 2014, just like in Trusty Tahr, so getting a younger version of Wine wont do the trick (it will even be prohibitively complicated to make it run in the first place). This also explains why certifi
helps - your systems' SSL certificate stores are outdated, mildly put.
Test the thing with the dots in the path. If this does not work, I have one more idea. zugbruecke
uses symlinks. One can remove them and use hard links or even copies of certain folders ... It's not ideal ...
Just moving the package directory of zugbruecke
did not work directly. Seems I need to modify some paths but do not have any more time now. I'll do it later.
Thanks a lot for all the help.
I just replicated part of your setup in a VM, CentOS 7.9 x86_64, python 3.6.8, Wine 4.0.4. It's an old system, lots of stuff to work around, and distinct issues in zugbruecke if I see correctly. I will report back to you when I have worked my way through them.
I did some test without any dot in the package path of zugbruecke
and wenv
and no dots in the WENV_WINEPREFIX
. No major difference regarding the reliability.
However, I found a discrepancy in the python version:
>>> ctypes.zb_get_parameter('pythonversion')
<Python 3.7.4.stable (win32)>
>>> subprocess.check_output(['wenv', 'python', '-c', 'import sys; print(sys.version)'])
b'3.6.8 (tags/v3.6.8:3c6b436a57, Dec 23 2018, 23:31:17) [MSC v.1916 32 bit (Intel)]\r\n'
Hello all, I have a Windows DLL file running on a CentOS host using zurbruecke. Unfortunately I very often get a time out when starting a session. Is there a good way to find the reason for this? The same setup on OpenSUSE Leap is much more stable.
The logging of zugbrucke is already active:
A WINE_out.txt is not written when timed out. Only if start up was successful, see below.