pysathq / pysat

A toolkit for SAT-based prototyping in Python
https://pysathq.github.io/
MIT License
391 stars 71 forks source link

Ctrl+C causes Python interpreter with pysat loaded to crash not gracefully yield an exception #53

Open GregoryMorse opened 4 years ago

GregoryMorse commented 4 years ago

Any time Ctrl+C is used to interrupt running code, it will throw a KeyboardInterrupt. But something is going on after doing SAT solving which causes it to not work all of the time and instead just exit the interpreter. Either this indicates Ctrl+C is being explicitly caught and inappropriately dealt with or some type of cleanup code is probably cause a crash.

Obviously important is that this is being run on Windows as Linux can have different behavior in this area as it uses a signal handling system which is a bit different. My event viewer clearly shows something is causing crashes and it never happens if I do not load pysat:

Faulting application name: python.exe, version: 3.8.2150.1013, time stamp: 0x5e55a7f0
Faulting module name: ntdll.dll, version: 10.0.18362.815, time stamp: 0xb29ecf52
Exception code: 0xc0000028
Fault offset: 0x00000000000fbf18
Faulting process id: 0x12448
Faulting application start time: 0x01d65234e08c1176
Faulting application path: C:\Program Files\Python38\python.exe
Faulting module path: C:\WINDOWS\SYSTEM32\ntdll.dll
Report Id: 75efd8f9-c92c-410e-9acc-a00aedd6f2b5
Faulting package full name: 
Faulting package-relative application ID: 

Since 0xc0000028 is a stack corruption issue, and everything else I am using is pure python, it implies that the compiled C code is responsible for this... Python 3.8 almost never crashes even under absurd memory load, etc.

I can attach a debugger and get a stack trace shortly as it will help narrow down which C code is the culprit.

  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 287, in enumerate
    mcs = self.compute()
  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 271, in compute
    self._compute()
  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 390, in _compute
    self.do_cld_check(self.setd[i:])
  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 441, in do_cld_check
    self._filter_satisfied(update_setd=True)
  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 355, in _filter_satisfied
    if not self.satc[i]:
  File "C:\Program Files\Python38\lib\site-packages\pysat\examples\lbx.py", line 355, in _filter_satisfied
    if not self.satc[i]:
KeyboardInterrupt
The program '[76260] python.exe' has exited with code -1073741784 (0xc0000028).

This is not particularly helpful beyond showing it definitely is some type of memory mismanagement in the object deletion chain for LBX. Obviously something is not cleaning itself up properly and causing crashes.

alexeyignatiev commented 4 years ago

Yep, that's a problem. It used to work properly a few months ago but now this is what happens. I need to investigate what causes it.

alexeyignatiev commented 4 years ago

OK, I think this concrete issue seems to be caused by the signal handling, which works fine on other systems, e.g. on Linux or MacOS. Since I don't have access to Windows, it is kind of problematic for me to investigate and so to fix it. :)

ghost commented 10 months ago

I have access to Windows, will try to reproduce it.

alexeyignatiev commented 10 months ago

Thanks, @dulanov!

ghost commented 10 months ago

So, pip install python-sat directly doesn't work on Windows by me, I'll analyze it deeper and maybe create a ticket:

...
      C:\Users\ABC\AppData\Local\Temp\pip-install-i8zkex_7\python-sat_d5feacf086464b6a84d8a0a15faaa9e5\solvers\cadical103\internal.hpp(21): fatal error C1083: Datei (Include) kann nicht geöffnet werden: "unistd.h": No such file or directory
      error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\VC\\Tools\\MSVC\\14.38.33130\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
...

Here unistd.h is POSIX related file existed only on UNIX systems, not on Windows.

But python setup.py install works, so I download some benchmarks from SAT Competition 2023, exactly this one and after some minutes of execution try to break it with Ctrl+C:

(.venv) PS C:\Workspace\References\pysathq-pysat\examples> python.exe lbx.py -d -e all -s glucose3 -vv 004b0f451f7d96f6a572e9e76360f51a-spg_420_280.cnf.xz
Traceback (most recent call last):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module>
    for i, mcs in enumerate(mcsls.enumerate()):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate
    mcs = self.compute()
          ^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 284, in compute
    self._filter_satisfied(update_setd=True)
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 374, in _filter_satisfied
    setd = setd.union(set(cl))
    ^^^^
KeyboardInterrupt

@GregoryMorse It seems to me that this problem no longer exists.

alexeyignatiev commented 10 months ago

Installation from pip should work on the platforms which I compiled it for (there are a few binary wheels provided).

As for the ctrl+c issue, it got interrupted while in Python (not the "solvers" C extension). By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?

ghost commented 10 months ago

there are a few binary wheels provided

This is strange because pip install imidiatelly start to recompile all dependencies from source.

ERROR: Could not build wheels for python-sat, which is required to install pyproject.toml-based projects

Building wheels for collected packages: python-sat Building wheel for python-sat (pyproject.toml): started Building wheel for python-sat (pyproject.toml): finished with status 'error'

May be instead of build should be install from, I also can fix it, no problem.

log.txt

ghost commented 10 months ago

By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?

It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.

alexeyignatiev commented 10 months ago

there are a few binary wheels provided

This is strange because pip install imidiatelly start to recompile all dependencies from source.

ERROR: Could not build wheels for python-sat, which is required to install pyproject.toml-based projects

log.txt

Well, I guess it means you are using a different platform. 🙂

alexeyignatiev commented 10 months ago

By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?

It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.

Sure, but the traceback log you attached clearly shows the execution was interrupted in Python. It would help to catch it when it is inside one of the solver's code.

ghost commented 10 months ago

I see, 3.12 isn't supported yet.

image

image

alexeyignatiev commented 10 months ago

Yes, it's not yet supported. It can be supported if @rjungbeck can compile pypblib for Python 3.12. 🙂

ghost commented 10 months ago

be supported if @rjungbeck can compile pypblib for Python 3.12

May be a good task to automate it using GitHub Actions ;)

ghost commented 10 months ago

By any chance, can you try it on hard CNF instances where a SaT call takes a lot of time so that you interrupt it during a SAT call?

It was exactly hard CNF formula (89Mb) and I tried several times, but I will try to more deeply analyze the moment of interrupt, ok.

Sure, but the traceback log you attached clearly shows the execution was interrupted in Python. It would help to catch it when it is inside one of the solver's code.

I found pretty simple but not trivial hard CNF borrowed from https://github.com/arminbiere/satch repository and this error now is reproducable:

(.venv) PS C:\Workspace\References\pysathq-pysat\examples> python.exe lbx.py -d -e all -s glucose3 -vv .\cnfs\prime65537.cnf
(.venv) PS C:\Workspace\References\pysathq-pysat\examples> $LASTEXITCODE
-1073741784

On NT family (Windows) it means:

from ntstatus.h:

//
// MessageId: STATUS_BAD_STACK
//
// MessageText:
//
// An invalid or unaligned stack was encountered during an unwind operation.
//
#define STATUS_BAD_STACK                 ((NTSTATUS)0xC0000028L)

in WSL2 (Ubuntu 22.04 LTS) similar situation took:

du@ABC:/mnt/c/Workspace/References/pysathq-pysat/examples$ python3 lbx.py -d -e all -s glucose3 -vv cnfs/prime65537.cnf
^CTraceback (most recent call last):
  File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 584, in <module>
    for i, mcs in enumerate(mcsls.enumerate()):
  File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 301, in enumerate
    mcs = self.compute()
  File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 285, in compute
    self._compute()
  File "/mnt/c/Workspace/References/pysathq-pysat/examples/lbx.py", line 408, in _compute
    if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]):
  File "/home/du/.local/lib/python3.10/site-packages/pysat/solvers.py", line 564, in solve
    return self.solver.solve(assumptions)
  File "/home/du/.local/lib/python3.10/site-packages/pysat/solvers.py", line 2827, in solve
    self.status = pysolvers.glucose3_solve(self.glucose, assumptions,
pysolvers.error: Caught keyboard interrupt
du@ABC:/mnt/c/Workspace/References/pysathq-pysat/examples$ echo $?
1
alexeyignatiev commented 10 months ago

Thanks, @dulanov. It would be nice to understand what causes this on Windows. I understand it may be difficult.

GregoryMorse commented 10 months ago

I'd guess the signal handler code is buggy on Windows and crashes. The easiest way to track the issue down if being able to reproduce it is to attach a debugger and see where it crashes. It could potentially be a concurrent programming issue and exist on Linux as well just be highly unlikely there by how signal handler is scheduled.

GregoryMorse commented 10 months ago

Also this looks like a bug in glucose from the Python stack trace.

ghost commented 10 months ago

Also this looks like a bug in glucose from the Python stack trace.

Interesting, will try with more fresh Glucose (4+) and also if you right, KeyboardInterrupt should work with other solvers.

rjungbeck commented 10 months ago

be supported if @rjungbeck can compile pypblib for Python 3.12

May be a good task to automate it using GitHub Actions ;)

The build process itself is automated via AppVeyor (similar to Github Axctions) . But it requires that the new target environment (Python 3.12) is configured in appveyor.yml, which must be done manually.

rjungbeck commented 10 months ago

Yes, it's not yet supported. It can be supported if @rjungbeck can compile pypblib for Python 3.12. 🙂

I have made a Python 3.12 version of pypblib (https://github.com/rjungbeck/pypblib/releases/download/pypblib-v1.0.439/pypblib-0.0.4-cp312-cp312-win_amd64.wh)

alexeyignatiev commented 10 months ago

Thanks so much, @rjungbeck! I've now included this configuration in appveyor/instdep.py.

ghost commented 10 months ago

Also this looks like a bug in glucose from the Python stack trace.

Interesting, will try with more fresh Glucose (4+) and also if you right, KeyboardInterrupt should work with other solvers.

Error is reproducable with all solvers including minisat (by default).

I found a place in solvers/pysolvers.cc and after local change in method py_glucose3_solve start to receive KeyboardInterrupt also on Windows:

image

(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose3 -vv .\examples\cnfs\prime65537.cnf
Traceback (most recent call last):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module>
    for i, mcs in enumerate(mcsls.enumerate()):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate
    mcs = self.compute()
          ^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 285, in compute
    self._compute()
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 408, in _compute
    if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 564, in solve 
    return self.solver.solve(assumptions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 2827, in solve
    self.status = pysolvers.glucose3_solve(self.glucose, assumptions,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741510

where exit Code 0xC000013A exactly means that the application terminated as a result of a CTRL+C.

Without this change glucose4 still exit silently:

(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose4 -vv .\examples\cnfs\prime65537.cnf
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741784

P.S. File pysolvers.cc contains a large amount of duplicate code for each solver and is an excellent candidate for a code generator, I will not change it yet, so as not to break the logic on other platforms.

GregoryMorse commented 10 months ago

Also this looks like a bug in glucose from the Python stack trace.

Interesting, will try with more fresh Glucose (4+) and also if you right, KeyboardInterrupt should work with other solvers.

Error is reproducable with all solvers including minisat (by default).

I found a place in solvers/pysolvers.cc and after local change in method py_glucose3_solve start to receive KeyboardInterrupt also on Windows:

image

(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose3 -vv .\examples\cnfs\prime65537.cnf
Traceback (most recent call last):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 584, in <module>
    for i, mcs in enumerate(mcsls.enumerate()):
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 301, in enumerate
    mcs = self.compute()
          ^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 285, in compute
    self._compute()
  File "C:\Workspace\References\pysathq-pysat\examples\lbx.py", line 408, in _compute
    if self.oracle.solve(assumptions=self.ss_assumps + self.bb_assumps + [self.setd[i]]):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 564, in solve 
    return self.solver.solve(assumptions)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Workspace\References\pysathq-pysat\.venv\Lib\site-packages\python_sat-0.1.8.dev12-py3.12-win-amd64.egg\pysat\solvers.py", line 2827, in solve
    self.status = pysolvers.glucose3_solve(self.glucose, assumptions,
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741510

where exit Code 0xC000013A exactly means that the application terminated as a result of a CTRL+C.

Without this change glucose4 still exit silently:

(.venv) PS C:\Workspace\References\pysathq-pysat> python.exe .\examples\lbx.py -d -e all -s glucose4 -vv .\examples\cnfs\prime65537.cnf
(.venv) PS C:\Workspace\References\pysathq-pysat> $LASTEXITCODE
-1073741784

P.S. File pysolvers.cc contains a large amount of duplicate code for each solver and is an excellent candidate for a code generator, I will not change it yet, so as not to break the logic on other platforms.

Disabling the signal handler causes a pass through but it's not a good solution at all. Obviously something with the use of PyOS_setsig is the issue.

One known issue is use of import_array() for numpy. Apparently it sets up signal handlers and can throw off the expected behavior.

With a simple debug build and debugging session this is an easy solve. Just pinpoint the root cause of the crash and it won't be hard to back track abd figure out the code issue. So this discussion seems a bit silly. Disabling the signal handler is not a fix. This should be a portable Windows supported API. Some small detail is misimplemented. It's likely buggy on Linux too, it's just there for whatever reason it doesn't crash. This is with high likelihood a universal bug from a clean coding practice standpoint. Microsoft tends to be more standards compliant these days than anyone else. So stability on Windows or MSVC means stability on Linux or GCC. But the reverse is never true. In fact, despite all the hate towards Microsoft and praise towards Linux over the years, the best programmers are invariably always the ones who can write portable code. So again to conclude, this project has poor signal handling code which requires manual debugging to solve.

ghost commented 10 months ago

@GregoryMorse I ended up in a rabbit hole ;) Apparently, we are dealing with the standard problem of processing kernel signals using setjmp / longjmp. The behavior in this case is non-deterministic and varies from platform to platform because Invoking longjmp from a nested signal handler is undefined (from Wikipedia).

[1] https://stackoverflow.com/questions/7334595/longjmp-out-of-signal-handler [2] https://mingw-users.narkive.com/mRVHRtGT/longjmp-in-signal-handler

GregoryMorse commented 10 months ago

@GregoryMorse I ended up in a rabbit hole ;) Apparently, we are dealing with the standard problem of processing kernel signals using setjmp / longjmp. The behavior in this case is non-deterministic and varies from platform to platform because Invoking longjmp from a nested signal handler is undefined (from Wikipedia).

[1] https://stackoverflow.com/questions/7334595/longjmp-out-of-signal-handler [2] https://mingw-users.narkive.com/mRVHRtGT/longjmp-in-signal-handler

Very nice find. So undefined behavior seems to work on Linux not on Windows. That is why I often consider source quality as higher when a shared source works on both which is the case for this repo.

So it should be redesigned to not use long jump it sounds like. I generally think setting a flag in the signal handler and periodically checking the flag in all potentially long running operations is the safest way to go.

ghost commented 10 months ago

@alexeyignatiev looks like wrong usage of longjump from sighandler on sigaltstack, but it was introduced some years ago with https://github.com/pysathq/pysat/commit/cf182bd8dc558d044868f8a7849147155ac0e1d6; POSIX.1 does not specify whether setjmp and longjmp save and restore the current set of blocked signals; if a program employs signal handling it should use POSIX's sigsetjmp/siglongjmp. - https://en.wikipedia.org/wiki/Setjmp.h

alexeyignatiev commented 10 months ago

Thanks for the finding, @dulanov. The question is why I decided to replace POSIX functions that I first used with C standard library alternative ones. 🙂 May 2018 was so long ago (even before a working Windows version) that it looks hard to understand now but there must have been a justification.