rigetti / rpcq

The RPC framework and message specification for @rigetti Quantum Cloud Services.
Apache License 2.0
75 stars 29 forks source link

Concurrent logging w/cl-syslog results in memory corruption #92

Open appleby opened 5 years ago

appleby commented 5 years ago

While attempting to run the pyquil test suite in parallel via the pytest-xdist plugin, I noticed occasional "Unhandled memory fault" errors like the following.

Based on the error message, this looked similar to https://github.com/rigetti/qvm/issues/110, so I tried disabling logging in quilc and, sure enough, the errors disappeared.

pyquil/tests/test_operator_estimation.py:1303:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyquil/operator_estimation.py:938: in measure_observables
    calibr_results, d_calibr_qub_idx = _exhaustive_symmetrization(qc, qubs_calibr, calibr_shots, calibr_prog)
pyquil/operator_estimation.py:1083: in _exhaustive_symmetrization
    total_prog_symm_native = qc.compiler.quil_to_native_quil(total_prog_symm)
pyquil/api/_error_reporting.py:238: in wrapper
    val = func(*args, **kwargs)
pyquil/api/_compiler.py:340: in quil_to_native_quil
    response = self.client.call('quil_to_native_quil', request, protoquil=protoquil).asdict()  # type: Dict
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <rpcq._client.Client object at 0x1235b57f0>, method_name = 'quil_to_native_quil', rpc_timeout = 10
args = (NativeQuilRequest(_type='NativeQuilRequest', quil='PRAGMA READOUT-POVM 0 "(0.95 0.18000000000000005 0.050000000000000...]\nMEASURE 0 ro[0]\n', target_device=TargetDevice(_type='TargetDevice', isa={'1Q': {'0': {}}, '2Q': {}}, specs=None)),)
kwargs = {'protoquil': None}
request = RPCRequest(_type='RPCRequest', id='bc69a5f1-57e2-47f5-96af-0eab678bebc4', jsonrpc='2.0', method='quil_to_native_quil',...nMEASURE 0 ro[0]\n', target_device=TargetDevice(_type='TargetDevice', isa={'1Q': {'0': {}}, '2Q': {}}, specs=None)),)})
start_time = 1567443910.04178, timeout = 9999.999046325684
raw_reply = b'\x85\xa7jsonrpc\xa32.0\xa5error\xbeUnhandled memory fault at #x0.\xa2id\xda\x00$bc69a5f1-57e2-47f5-96af-0eab678bebc4\xa8warnings\x90\xa5_type\xa8RPCError'
reply = RPCError(error='Unhandled memory fault at #x0.', id='bc69a5f1-57e2-47f5-96af-0eab678bebc4', jsonrpc='2.0', warnings=[])

The work-around in qvm-app was to add a WITH-LOCKED-LOG macro and use it to acquire a global lock around any locking calls.

In the case of RPCQ, it's not so simple since the logger instance is passed in by the caller of RPCQ:START-SEVER.

Ideally, this would be resolved in CL-SYSLOG, if possible, but we might want to implement a workaround in QUILC/RPCQ in case that turns out to be impossible / impractical / slow to get merged.

Here is a minimal-ish testcase that reproduces the issue.

(ql:quickload :rpcq)

(defun test-method ()
  "hey")

(let* ((number-of-workers 4)
       (addr (format nil "inproc://~a" (uuid:make-v4-uuid)))
       (server-function
         (lambda ()
           (let ((dt (rpcq:make-dispatch-table)))
             (rpcq:dispatch-table-add-handler dt 'test-method)
             (rpcq:start-server :dispatch-table dt
                                :listen-addresses (list addr)
                                :logger (make-instance 'cl-syslog:rfc5424-logger
                                                       :app-name "logtest"
                                                       :facility ':local0
                                                       :maximum-priority ':debug
                                                       :log-writer
                                                       #-windows (cl-syslog:tee-to-stream
                                                                  (cl-syslog:syslog-log-writer "logtest" :local0)
                                                                  *error-output*))))))
       (server-thread (bt:make-thread server-function)))
  (sleep 1)
  (let ((threads '()))
    (unwind-protect
         (loop :repeat number-of-workers :do
           (push (bt:make-thread (lambda ()
                                   (loop :repeat 20 :do
                                     (rpcq:with-rpc-client (client addr)
                                       (rpcq:rpc-call client "test-method")))))
                 threads))
      (progn
        (dolist (thread threads)
          (bt:join-thread thread))
        (bt:destroy-thread server-thread)))))
appleby commented 5 years ago

Simpler reproduction that doesn't involve RPCQ or CL-SYSLOG:RFC5424-LOGGER, just CL-SYSLOG:LOG.

(let ((threads '()))
  (unwind-protect
       (loop :repeat 4 :do
         (push (bt:make-thread (lambda ()
                                 (loop :repeat 20 :do
                                   (cl-syslog:log "foo" ':user ':info "hello"))))
               threads))
    (dolist (thread threads)
      (bt:join-thread thread))))
appleby commented 5 years ago

The plot thickens! That last test does not reproduce on linux laptop (with older SBCL). My linux laptop is only dual-core (4x logical) though, so maybe that's the reason.

I'll try installing a linux VM on my macbook and see if it reproduces there.

appleby commented 5 years ago

Neither of the above tests reproduce on the latest rigetti/rpcq docker image running on my macbook, which reports:

* (qvm::count-logical-cores)
8
* (software-type)
"Linux"
* (software-version)
"4.9.184-linuxkit"
appleby commented 5 years ago

man 3 asl on my macbook says

The state information in a client handle is not protected by locking or thread synchronization mechanisms, except for one special case where NULL is used as a client handle. That special case is described below.

It is not safe for two or more threads to use a single client handle simultaneously. Multi-threaded applications should generally create one client handle for each thread or serial dispatch queue that logs messages. A client handle may only be safely shared amongst multiple threads if the application uses locks or some synchronization strategy to ensure single-threaded access.

As a special case, the ASL library allows the use of NULL in place of a client handle. In this case, the library uses an internal structure which contains its own lock. Multiple threads may safely use NULL in place of an ASL client handle, although there may be contention for the lock.

Not sure I am looking in the correct place (or correct version), but Apple's syslog wrapper here appears to maintain a global asl client and doesn't appear to add any locking around it for syslog calls.

appleby commented 5 years ago

Backtrace of failure on Mac OS from the bordeaux-threads-only testcase:

Unhandled memory fault at #x14.
   [Condition of type SB-SYS:MEMORY-FAULT-ERROR]

Restarts:
 0: [ABORT] abort thread (#<THREAD "Anonymous thread" RUNNING {10029A8093}>)

Backtrace:
  0: ("bogus stack frame")
  1: ("foreign function: asl_msg_merge")
  2: ("foreign function: _asl_send_message")
  3: ("foreign function: _asl_lib_vlog")
  4: ("foreign function: _vsyslog")
  5: ("foreign function: syslog")
  6: (CL-SYSLOG:LOG "ZONK" :USER :INFO "ZONK" 8)
  7: ((LAMBDA NIL :IN #:DROP-THRU-TAG-1))
  8: ((LAMBDA NIL :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))
  9: ((FLET SB-UNIX::BODY :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 10: ((FLET "WITHOUT-INTERRUPTS-BODY-4" :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 11: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 12: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
 13: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {760DD9B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THREAD "..
 14: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "Anonymous thread" RUNNING {10029A8093}> NIL #<CLOSURE (LAMBDA NIL :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS) {10029A803B}> NIL)
 15: ("foreign function: call_into_lisp")
 16: ("foreign function: new_thread_trampoline")
 17: ("foreign function: _pthread_body")
 18: ("foreign function: _pthread_start")
 19: ("foreign function: thread_start")
appleby commented 5 years ago

Dumping state before context-switch. Need some time to ponder the correct fix.

Long story short: it looks like if you want thread-safe syslogging on macOS, you have to do the locking yourself. You thought POSIX would protect you? Muuuhahahaa. Fools!

Strangely openlog, closelog, and setlogmask do manage the locking for you...not sure why syslog was left out of the pthread_mutex_lock party.

Silver lining: you can get a thread-safe (I think) syslog(3) compatible interface by calling _syslog$DARWIN_EXTSN rather than plain old _syslog. If you write a C program on macOS that calls syslog, the compiler does some define + __asm magic and arranges to tell the assembler that even though you spelled it syslog you really wanted _syslog$DARWIN_EXTSN.

Observe:

$ cat syslog.c
#include <syslog.h>

int main(void) {
    openlog("syslogexample", 0, LOG_LOCAL1);
    syslog(LOG_NOTICE, "FURBLE");
    closelog();
    return 0;
}
$ cc syslog.c
$ otool -tV a.out | grep callq
0000000100000f3d        callq   0x100000f68 ## symbol stub for: _openlog
0000000100000f50        callq   0x100000f6e ## symbol stub for: _syslog$DARWIN_EXTSN
0000000100000f55        callq   0x100000f62 ## symbol stub for: _closelog

Indeed the syslog.h header is renaming the symbol for us.

$ cc -E syslog.c | grep 'void syslog'
void syslog(int, const char *, ...) __asm("_" "syslog" "$DARWIN_EXTSN") __attribute__((__format__ (__printf__, 2, 3))) __attribute__((__not_tail_called__));

And in the Apple syslog sources in syslog.c we find

extern typeof(syslog) syslog_legacy asm("_syslog");
extern typeof(syslog) syslog_os_log asm("_syslog" __DARWIN_SUF_EXTSN);

void
syslog_legacy(int pri, const char *fmt, ...)
{
    va_list ap;
    va_start(ap, fmt);
    _vsyslog(pri, fmt, ap, __builtin_return_address(0), true);
    va_end(ap);
}

void
syslog_os_log(int pri, const char *fmt, ...)
{
    va_list ap;
    va_start(ap, fmt);
    _vsyslog(pri, fmt, ap, __builtin_return_address(0), false);
    va_end(ap);
}

In other words, calling (the mach-o-object symbol) _syslog gets you syslog_legacy, which winds up calling legacy asl(3) APIs in a non-thread-safe way. Whereas calling (the mach-o-object symbol) _syslog$DARWIN_EXTSN gets you syslog_os_log, which instead calls os_log_with_args, which appears to be thread safe (I couldn't make it crash) but I'm not certain because I couldn't find sources for Apple's os_log(3) APIs anywhere.

Tying this back in to cl-syslog: the core of cl-syslog is a pretty thin wrapper over good ol' syslog, and is defined like so:

(cffi:defcfun "syslog" :void
  (priority :int)
  (format :string)
  &rest)

If I apply the following patch, then memory corruption woes disappear when cl-sysloging on SBCL/macOS:

diff --git a/cl-syslog.asd b/cl-syslog.asd
index 760be33..1f5ae1f 100644
--- a/cl-syslog.asd
+++ b/cl-syslog.asd
@@ -3,7 +3,7 @@
   :version (:read-file-form "VERSION.txt")
   :licence "MIT (See LICENSE)"
   :description "Common Lisp syslog interface."
-  :depends-on (#:alexandria #:cffi #:global-vars #:usocket #:split-sequence #:babel #:local-time
+  :depends-on (#:alexandria #:cffi #:global-vars #:usocket #:split-sequence #:babel #:local-time #:trivial-features
                #+sbcl #:sb-posix)
   :in-order-to ((asdf:test-op (asdf:test-op #:cl-syslog-tests)))
   :components ((:file "package")
diff --git a/cl-syslog.lisp b/cl-syslog.lisp
index 2b33bdf..95eb197 100644
--- a/cl-syslog.lisp
+++ b/cl-syslog.lisp
@@ -31,7 +31,9 @@

 (cffi:defcfun "closelog" :void)

-(cffi:defcfun "syslog" :void
+(cffi:defcfun (syslog #-os-macosx "syslog"
+                      #+os-macosx "syslog$DARWIN_EXTSN")
+    :void
   (priority :int)
   (format :string)
   &rest)

That syslog$DARWIN_EXTSN is gross, but would be more palatable if there is a way to grovel the desired symbol name for syslog out of the compiler/linker.

Weird fact no. 47: calling syslog$DARWIN_EXTSN doesn't appear to actually send messages to syslogd? I'm still not 100% sure about that because macOS makes it surprisingly difficult to kill syslogd. launchctl keeps restarting it, but doesn't give you the option to shut the service down. In my desperation, I even tried mving the binary aside to thwart launchctl, but got an "operation not permitted", even though I was root! I never thought I'd say it, but I miss systemd. But I digress. The point is that messages logged with syslog$DARWIN_EXTSN do appear in Console.app, but do not appear in /var/log/system.log, which I think means they aren't actually sent to syslogd, but maybe that is neither here nor there. My point is that it would be slightly awkward for a library called cl-syslog if wasn't actually sending messages to syslogd... On the other hand, this is exactly what you get when you call syslog from a C source file on macOS, so maybe it's not so bad.

appleby commented 4 years ago

Did some more digging on this today. The good news, I was able to use cffi-grovel's DEFWRAPPER to hack around the issue of syslog vs syslog$DARWIN_EXTSN on recent macOS in a way that is slightly less disgusting than hard coding syslog_$DARWIN_EXTSN as the foreign name.

The bad news: it appears that syslog$DARWIN_EXTSN is not thread safe either, although it is harder to trigger the bug via syslog$DARWIN_EXTSN than via syslog, possibly because the $DARWIN_EXTSN version replaces the unsafe _asl_lib_vlog with the (maybe safe? source not available) os_log_with_args, but both versions still call _asl_evaluate_send, which is apparently not safe.

Unhandled memory fault at #x6.
   [Condition of type SB-SYS:MEMORY-FAULT-ERROR]

Restarts:
 0: [ABORT] abort thread (#<THREAD "Anonymous thread" RUNNING {100670A983}>)

Backtrace:
  0: ("bogus stack frame")
  1: ("foreign function: asl_msg_lookup")
  2: ("foreign function: _asl_evaluate_send")
  3: ("foreign function: _vsyslog")
  4: ("foreign function: syslog$DARWIN_EXTSN")
  5: ("foreign function: syslog_cffi_darwin_wrapper")
  6: (CL-SYSLOG:LOG "foo" :USER :INFO "hello" 0)
  7: ((LAMBDA NIL :IN #:DROP-THRU-TAG-1))
  8: ((LAMBDA NIL :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS))
  9: ((FLET SB-UNIX::BODY :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 10: ((FLET "WITHOUT-INTERRUPTS-BODY-4" :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 11: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
 12: ((FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::CALL-WITH-MUTEX))
 13: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {7E0DD9B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THREAD "..
 14: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "Anonymous thread" RUNNING {100670A983}> NIL #<CLOSURE (LAMBDA NIL :IN BORDEAUX-THREADS::BINDING-DEFAULT-SPECIALS) {100670A92B}> NIL)
 15: ("foreign function: call_into_lisp")
 16: ("foreign function: new_thread_trampoline")
 17: ("foreign function: _pthread_body")
 18: ("foreign function: _pthread_start")
 19: ("foreign function: thread_start")

For the record, here is the proof-of-concept CL-SYSLOG patch that installs a syslog wrapper and arranges to call it instead of _syslog on #+DARWIN, which, as mentioned above, mitigates but does not fix the problem :(.

diff --git a/cl-syslog.asd b/cl-syslog.asd
index 760be33..e61ead8 100644
--- a/cl-syslog.asd
+++ b/cl-syslog.asd
@@ -3,11 +3,15 @@
   :version (:read-file-form "VERSION.txt")
   :licence "MIT (See LICENSE)"
   :description "Common Lisp syslog interface."
+  :defsystem-depends-on ("cffi-grovel")
   :depends-on (#:alexandria #:cffi #:global-vars #:usocket #:split-sequence #:babel #:local-time
-               #+sbcl #:sb-posix)
+               #:trivial-features #+sbcl #:sb-posix)
   :in-order-to ((asdf:test-op (asdf:test-op #:cl-syslog-tests)))
+  :serial t
   :components ((:file "package")
                (:file "variable")
+               #+darwin
+               (:cffi-wrapper-file "macos-syslog-wrapper")
                (:file "cl-syslog")
                (:file "rfc5424")
                (:file "rfc5424-reserved")
diff --git a/cl-syslog.lisp b/cl-syslog.lisp
index 2b33bdf..989a2b4 100644
--- a/cl-syslog.lisp
+++ b/cl-syslog.lisp
@@ -31,7 +31,9 @@

 (cffi:defcfun "closelog" :void)

-(cffi:defcfun "syslog" :void
+(cffi:defcfun (syslog #-darwin "syslog"
+                      #+darwin "syslog_cffi_darwin_wrapper")
+    :void
   (priority :int)
   (format :string)
   &rest)
diff --git a/macos-syslog-wrapper.lisp b/macos-syslog-wrapper.lisp
new file mode 100644
index 0000000..cf21a27
--- /dev/null
+++ b/macos-syslog-wrapper.lisp
@@ -0,0 +1,35 @@
+(in-package #:syslog)
+
+(include "syslog.h")
+(include "stdarg.h")
+
+;; This flag is needed because CFFI-TOOLCHAIN winds up calling
+;; CFFI-TOOLCHAIN:SBCL-TOOLCHAIN-PARAMETERS on SBCL, which reads
+;; src/runtime/sbcl.mk and sets CFFI-TOOLCHAIN:*CC-FLAGS* to whatever
+;; sbcl.mk defines for CFLAGS. On my macOS 10.14 machine, sbcl.mk's
+;; CFLAGS includes the flag -mmacosx-version-min=10.6. According to
+;; syslog.h on my mac, __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__
+;; must be at least __MAC_10_13 in order to get the thread-safe
+;; _syslog$DARWIN_EXTSN, rather than the unsafe _syslog.
+(flag "-mmacosx-version-min=10.13")
+
+
+;; DEFWRAPPER doesn't support the use of &REST to indicate varargs
+;; like DEFCFUN, neither does it support :VOID return type, so we
+;; resort to a small C snippet to get 'er done.
+;;
+;; There is an open PR to add support for VOID return type to
+;; DEFWRAPPER: https://github.com/cffi/cffi/pull/144
+;;
+;; (defwrapper "syslog" :void
+;;   (priority :int)
+;;   (format :string)
+;;   &rest)
+(c
+ "void syslog_cffi_darwin_wrapper(int priority, char* format, ...)"
+ "{"
+ "    va_list ap;"
+ "    va_start(ap, format);"
+ "    syslog(priority, format, ap);"
+ "    va_end(ap);"
+ "}")