Open GregoryMorse opened 4 years ago
Yep, that's a problem. It used to work properly a few months ago but now this is what happens. I need to investigate what causes it.
OK, I think this concrete issue seems to be caused by the signal handling, which works fine on other systems, e.g. on Linux or MacOS. Since I don't have access to Windows, it is kind of problematic for me to investigate and so to fix it. :)
I have access to Windows, will try to reproduce it.
Thanks, @dulanov!
So, pip install python-sat
directly doesn't work on Windows by me, I'll analyze it deeper and maybe create a ticket:
...
C:\Users\ABC\AppData\Local\Temp\pip-install-i8zkex_7\python-sat_d5feacf086464b6a84d8a0a15faaa9e5\solvers\cadical103\internal.hpp(21): fatal error C1083: Datei (Include) kann nicht geöffnet werden: "unistd.h": No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.38.33130\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
...
Here unistd.h
is POSIX related file existed only on UNIX systems, not on Windows.
But python setup.py install
works, so I download some benchmarks from SAT Competition 2023, exactly this one and after some minutes of execution try to break it with Ctrl+C
:
(.venv) PS C:\Workspace\References\pysathq-pysat\examples> python.exe lbx.py -d -e all -s glucose3 -vv 004b0f451f7d96f6a572e9e76360f51a-spg_420_280.cnf.xz
Traceback (most recent call last):
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module>
for i, mcs in enumerate(mcsls.enumerate()):
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate
mcs = self.compute()
^^^^^^^^^^^^^^
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 284, in compute
self._filter_satisfied(update_setd=True)
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 374, in _filter_satisfied
setd = setd.union(set(cl))
^^^^
KeyboardInterrupt
@GregoryMorse It seems to me that this problem no longer exists.
Installation from pip should work on the platforms which I compiled it for (there are a few binary wheels provided).
As for the ctrl+c issue, it got interrupted while in Python (not the "solvers" C extension). By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?
there are a few binary wheels provided
This is strange because pip install
imidiatelly start to recompile all dependencies from source.
ERROR: Could not build wheels for python-sat, which is required to install pyproject.toml-based projects
Building wheels for collected packages: python-sat Building wheel for python-sat (pyproject.toml): started Building wheel for python-sat (pyproject.toml): finished with status 'error'
May be instead of build should be install from, I also can fix it, no problem.
By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?
It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.
there are a few binary wheels provided
This is strange because
pip install
imidiatelly start to recompile all dependencies from source.
ERROR: Could not build wheels for python-sat, which is required to install pyproject.toml-based projects
Well, I guess it means you are using a different platform. 🙂
By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?
It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.
Sure, but the traceback log you attached clearly shows the execution was interrupted in Python. It would help to catch it when it is inside one of the solver's code.
I see, 3.12 isn't supported yet.
Yes, it's not yet supported. It can be supported if @rjungbeck can compile pypblib for Python 3.12. 🙂
be supported if @rjungbeck can compile pypblib for Python 3.12
May be a good task to automate it using GitHub Actions ;)
By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?
It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.
Sure, but the traceback log you attached clearly shows the execution was interrupted in Python. It would help to catch it when it is inside one of the solver's code.
I found pretty simple but not trivial hard CNF borrowed from https://github.com/arminbiere/satch repository and this error now is reproducable:
(.venv) PS C:\Workspace\References\pysathq-pysat\examples> python.exe lbx.py -d -e all -s glucose3 -vv .\cnfs\prime65537.cnf
(.venv) PS C:\Workspace\References\pysathq-pysat\examples> $LASTEXITCODE
-1073741784
On NT family (Windows) it means:
from ntstatus.h:
//
// MessageId: STATUS_BAD_STACK
//
// MessageText:
//
// An invalid or unaligned stack was encountered during an unwind operation.
//
#define STATUS_BAD_STACK ((NTSTATUS)0xC0000028L)
in WSL2 (Ubuntu 22.04 LTS) similar situation took:
du@ABC:/mnt/c/Workspace/References/pysathq-pysat/examples$ python3 lbx.py -d -e all -s glucose3 -vv cnfs/prime65537.cnf
^CTraceback (most recent call last):
File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 584, in <module>
for i, mcs in enumerate(mcsls.enumerate()):
File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 301, in enumerate
mcs = self.compute()
File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 285, in compute
self._compute()
File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 408, in _compute
if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]):
File "/home/du/.local/lib/python3.10/site-packages/pysat/solvers.py", line 564, in solve
return self.solver.solve(assumptions)
File "/home/du/.local/lib/python3.10/site-packages/pysat/solvers.py", line 2827, in solve
self.status = pysolvers.glucose3_solve(self.glucose, assumptions,
pysolvers.error: Caught keyboard interrupt
du@ABC:/mnt/c/Workspace/References/pysathq-pysat/examples$ echo $?
1
Thanks, @dulanov. It would be nice to understand what causes this on Windows. I understand it may be difficult.
I'd guess the signal handler code is buggy on Windows and crashes. The easiest way to track the issue down if being able to reproduce it is to attach a debugger and see where it crashes. It could potentially be a concurrent programming issue and exist on Linux as well just be highly unlikely there by how signal handler is scheduled.
Also this looks like a bug in glucose from the Python stack trace.
Also this looks like a bug in glucose from the Python stack trace.
Interesting, will try with more fresh Glucose (4+) and also if you right, KeyboardInterrupt
should work with other solvers.
be supported if @rjungbeck can compile pypblib for Python 3.12
May be a good task to automate it using GitHub Actions ;)
The build process itself is automated via AppVeyor (similar to Github Axctions) . But it requires that the new target environment (Python 3.12) is configured in appveyor.yml, which must be done manually.
Yes, it's not yet supported. It can be supported if @rjungbeck can compile pypblib for Python 3.12. 🙂
I have made a Python 3.12 version of pypblib (https://github.com/rjungbeck/pypblib/releases/download/pypblib-v1.0.439/pypblib-0.0.4-cp312-cp312-win_amd64.wh)
Thanks so much, @rjungbeck! I've now included this configuration in appveyor/instdep.py
.
Also this looks like a bug in glucose from the Python stack trace.
Interesting, will try with more fresh Glucose (4+) and also if you right,
KeyboardInterrupt
should work with other solvers.
Error is reproducable with all solvers including minisat (by default).
I found a place in solvers/pysolvers.cc
and after local change in method py_glucose3_solve
start to receive KeyboardInterrupt
also on Windows:
(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose3 -vv .\examples\cnfs\prime65537.cnf
Traceback (most recent call last):
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module>
for i, mcs in enumerate(mcsls.enumerate()):
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate
mcs = self.compute()
^^^^^^^^^^^^^^
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 285, in compute
self._compute()
File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 408, in _compute
if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 564, in solve
return self.solver.solve(assumptions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 2827, in solve
self.status = pysolvers.glucose3_solve(self.glucose, assumptions,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741510
where exit Code 0xC000013A
exactly means that the application terminated as a result of a CTRL+C
.
Without this change glucose4
still exit silently:
(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose4 -vv .\examples\cnfs\prime65537.cnf
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741784
P.S. File pysolvers.cc
contains a large amount of duplicate code for each solver and is an excellent candidate for a code generator, I will not change it yet, so as not to break the logic on other platforms.
Also this looks like a bug in glucose from the Python stack trace.
Interesting, will try with more fresh Glucose (4+) and also if you right,
KeyboardInterrupt
should work with other solvers.Error is reproducable with all solvers including minisat (by default).
I found a place in
solvers/pysolvers.cc
and after local change in methodpy_glucose3_solve
start to receiveKeyboardInterrupt
also on Windows:(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose3 -vv .\examples\cnfs\prime65537.cnf Traceback (most recent call last): File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module> for i, mcs in enumerate(mcsls.enumerate()): File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate mcs = self.compute() ^^^^^^^^^^^^^^ File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 285, in compute self._compute() File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 408, in _compute if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 564, in solve return self.solver.solve(assumptions) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 2827, in solve self.status = pysolvers.glucose3_solve(self.glucose, assumptions, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyboardInterrupt (.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE -1073741510
where exit Code
0xC000013A
exactly means that the application terminated as a result of aCTRL+C
.Without this change
glucose4
still exit silently:(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose4 -vv .\examples\cnfs\prime65537.cnf (.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE -1073741784
P.S. File
pysolvers.cc
contains a large amount of duplicate code for each solver and is an excellent candidate for a code generator, I will not change it yet, so as not to break the logic on other platforms.
Disabling the signal handler causes a pass through but it's not a good solution at all. Obviously something with the use of PyOS_setsig is the issue.
One known issue is use of import_array() for numpy. Apparently it sets up signal handlers and can throw off the expected behavior.
With a simple debug build and debugging session this is an easy solve. Just pinpoint the root cause of the crash and it won't be hard to back track abd figure out the code issue. So this discussion seems a bit silly. Disabling the signal handler is not a fix. This should be a portable Windows supported API. Some small detail is misimplemented. It's likely buggy on Linux too, it's just there for whatever reason it doesn't crash. This is with high likelihood a universal bug from a clean coding practice standpoint. Microsoft tends to be more standards compliant these days than anyone else. So stability on Windows or MSVC means stability on Linux or GCC. But the reverse is never true. In fact, despite all the hate towards Microsoft and praise towards Linux over the years, the best programmers are invariably always the ones who can write portable code. So again to conclude, this project has poor signal handling code which requires manual debugging to solve.
@GregoryMorse I ended up in a rabbit hole ;) Apparently, we are dealing with the standard problem of processing kernel signals using setjmp / longjmp
. The behavior in this case is non-deterministic and varies from platform to platform because Invoking longjmp from a nested signal handler is undefined
(from Wikipedia).
[1] https://stackoverflow.com/questions/7334595/longjmp-out-of-signal-handler [2] https://mingw-users.narkive.com/mRVHRtGT/longjmp-in-signal-handler
@GregoryMorse I ended up in a rabbit hole ;) Apparently, we are dealing with the standard problem of processing kernel signals using
setjmp / longjmp
. The behavior in this case is non-deterministic and varies from platform to platform becauseInvoking longjmp from a nested signal handler is undefined
(from Wikipedia).[1] https://stackoverflow.com/questions/7334595/longjmp-out-of-signal-handler [2] https://mingw-users.narkive.com/mRVHRtGT/longjmp-in-signal-handler
Very nice find. So undefined behavior seems to work on Linux not on Windows. That is why I often consider source quality as higher when a shared source works on both which is the case for this repo.
So it should be redesigned to not use long jump it sounds like. I generally think setting a flag in the signal handler and periodically checking the flag in all potentially long running operations is the safest way to go.
@alexeyignatiev looks like wrong usage of longjump
from sighandler on sigaltstack, but it was introduced some years ago with https://github.com/pysathq/pysat/commit/cf182bd8dc558d044868f8a7849147155ac0e1d6; POSIX.1 does not specify whether setjmp and longjmp save and restore the current set of blocked signals; if a program employs signal handling it should use POSIX's sigsetjmp/siglongjmp.
- https://en.wikipedia.org/wiki/Setjmp.h
Thanks for the finding, @dulanov. The question is why I decided to replace POSIX functions that I first used with C standard library alternative ones. 🙂 May 2018 was so long ago (even before a working Windows version) that it looks hard to understand now but there must have been a justification.
Any time Ctrl+C is used to interrupt running code, it will throw a
KeyboardInterrupt
. But something is going on after doing SAT solving which causes it to not work all of the time and instead just exit the interpreter. Either this indicates Ctrl+C is being explicitly caught and inappropriately dealt with or some type of cleanup code is probably cause a crash.Obviously important is that this is being run on Windows as Linux can have different behavior in this area as it uses a signal handling system which is a bit different. My event viewer clearly shows something is causing crashes and it never happens if I do not load pysat:
Since 0xc0000028 is a stack corruption issue, and everything else I am using is pure python, it implies that the compiled C code is responsible for this... Python 3.8 almost never crashes even under absurd memory load, etc.
I can attach a debugger and get a stack trace shortly as it will help narrow down which C code is the culprit.
This is not particularly helpful beyond showing it definitely is some type of memory mismanagement in the object deletion chain for LBX. Obviously something is not cleaning itself up properly and causing crashes.