Windows: Error zmq-ENOTSOCK

fleimgruber commented 5 years ago

During the follow up in https://github.com/dzop/emacs-zmq/issues/3#issuecomment-480388380, I tried debugging on another Windows machine and ran into another error. Since it surfaces in the subprocessing, it might be related, but I thought opening a new issue would be good in any case. Using the precompiled DLLs, the tests work from the command line. But when actually used from within a GUI emacs with an org-mode source block, I get the error error in process filter: progn: Error in ZMQ subprocess: zmq-ENOTSOCK.

With (setq zmq--subprocess-debug t) here is the full traceback: https://gist.github.com/fleimgruber/7b0702b6f9beda90f125885de8be27d5

nnicandro commented 5 years ago

Can you try changing utf-8-unix to utf-8-auto in zmq-subprocess-read and try again.

I think the ("Unknown error") you are getting from the subprocess is actually from

https://github.com/emacs-mirror/emacs/blob/acec59c9f9c6d552c54ad448f4be898d8ee36f31/src/coding.c#L803

don't know why its reported as zmq-ENOTSOCK.

fleimgruber commented 5 years ago

I applied the suggested change and tried again. I get basically the same error (with some obvious differences wrt hashes and ports), see updated gist.

nnicandro commented 5 years ago

Hmmm..., I still feel like it has to do with encoding since the link I mentioned is the only place where "Unknown error" seems to occur, both in the Emacs sources and the libzmq sources. Could you comment out the line

https://github.com/dzop/emacs-zmq/blob/b35b0b5fcd152a959fb9fcabb8161033ca073828/zmq.el#L506

and see if that changes anything. Also can you show the value of default-process-coding-system on the Windows machine that doesn't work and also on the one that does.

fleimgruber commented 5 years ago

default-process-coding-system has the value (utf-8-dos . utf-8-unix) on the machine that does not work. On the same machine, commenting out the line you suggested led to the same error zmq-ENOTSOCK. The machine I used in #3 is not accessible at the moment, because of a switch of company hardware. The current tests work though, so maybe we could try to come up with a failing test case for this issue and work on from there?

nnicandro commented 5 years ago

From the backtrace you posted, the error happens during the jupyter-ioloop-wait-until call when waiting for the channel subprocess to send a start event. The function below should reproduce the error.

(cl-defmethod jupyter-ioloop-handler ((_ioloop jupyter-channel-ioloop)
                                      (_ (eql :emacs-zmq-issue10))
                                      event)
  (message "EVENT: %s" event))

(defun emacs-zmq-issue10 ()
  (let ((ioloop (jupyter-channel-ioloop))
        (session (jupyter-session :key "foo")))
    (jupyter-ioloop-start ioloop session :emacs-zmq-issue10)
    (unwind-protect
        (jupyter-ioloop-wait-until ioloop 'start #'identity)
      (delete-process (oref ioloop process)))))

nnicandro commented 5 years ago

OK, I've stripped down my previous code so that Jupyter isn't required. You should also be able to reproduce the error with the following code

(let ((proc
       (zmq-start-process
        `(lambda (ctx)
           (let ((poller (zmq-poller)))
             (zmq-poller-add poller 0 zmq-POLLIN)
             (let (events)
               (condition-case nil
                   (progn
                     (zmq-prin1 '(start))
                     (while t
                       (setq events (condition-case nil
                                        (zmq-poller-wait-all poller 1 200)
                                      ((zmq-EAGAIN zmq-EINTR zmq-ETIMEDOUT) nil)))
                       (let ((stdin-event (assq 0 events)))
                         (when stdin-event
                           (setq events (delq stdin-event events))
                           ;; Whenever the parent process sends something using
                           ;; `zmq-subprocess-send', just print it back to the parent
                           ;; Emacs process.
                           (zmq-prin1 (zmq-subprocess-read))))))
                 (quit
                  (zmq-prin1 '(quit)))))))
        :filter (lambda (event)
                  ;; Show a message for every list the subprocess prints using `zmq-prin1'
                  (message "EVENT: %s" event)))))
  (unwind-protect
      (progn
        (sleep-for 0.2)
        (zmq-subprocess-send proc (list 'foo "bar"))
        (sleep-for 0.2))
    (delete-process proc)))

fleimgruber commented 5 years ago

Thanks, I can confirm this reproduces, see backtrace here. Now I am wondering how we could pick it apart to get at the root cause.

nnicandro commented 5 years ago

Can you replace

(zmq-poller-wait-all poller 1 200)

with

(progn
  (sleep-for 0.2)
  nil)

and see if you still get the error.

fleimgruber commented 5 years ago

That change does not cause any error and prints this to *Messages*

EVENT: (start)
nil

nnicandro commented 5 years ago

OK, I was thinking that it may have something to do with polling STDIN since you got the zmq-ENOTSOCK as the error symbol. Still not sure why you get that error.

Can you then put back the zmq-poller-wait-all call and replace

(zmq-subprocess-read)

with

(read (base64-decode-string (read-minibuffer "")))

nnicandro commented 5 years ago

If that still gives you the error, then remove the whole (let ((stdin-event (assq 0 events))) ... block and try again. If you still get an error in this case, it probably does have to do with polling STDIN.

fleimgruber commented 5 years ago

The first one, zmq-poller-wait-all and (read (base64-decode-string (read-minibuffer ""))) gives the same error as does the variant without the (let ((stdin-event (assq 0 events))) ... part.

nnicandro commented 5 years ago

OK, it does look like an issue with trying to poll STDIN using zmq. From the zmq documentation http://api.zeromq.org/master:zmq-poller

zmq_poller_add_fd, zmq_poller_modify_fd and zmq_poller_remove_fd are analogous to the previous functions but manage regular file descriptiors registered with a poller. On Windows, these functions can only be used with WinSock sockets.

zmq_poller_add_fd is what gets called by (zmq-poller-add poller 0 zmq-POLLIN). So it looks like the zmq-ENOTSOCK error is happening because we are trying to poll 0 (STDIN) which is not a WinSock, but a file handle in Windows.

The question is why did this previously work? What version of ZMQ are you using, i.e. what is the output of (zmq-version)? And do you remember if this was different whenever you had it working on your old machine?

Maybe we can use zmq-poll instead, although I believe the zmq-poller interface is meant as a replacement for zmq-poll. From the documentation of zmq-poll there is no mention of any restriction about WinSock sockets. Can you try again with the following

(let ((proc
       (zmq-start-process
        `(lambda (ctx)
           (let ((poll-items (list (cons 0 (list zmq-POLLIN)))))
             (let (events)
               (condition-case nil
                   (progn
                     (zmq-prin1 '(start))
                     (while t
                       (setq events (condition-case nil
                                        (zmq-poll poll-items 200)
                                      ((zmq-EAGAIN zmq-EINTR zmq-ETIMEDOUT) nil)))
                       (let ((stdin-event (assq 0 events)))
                         (when stdin-event
                           (setq events (delq stdin-event events))
                           ;; Whenever the parent process sends something using
                           ;; `zmq-subprocess-send', just print it back to the parent
                           ;; Emacs process.
                           (zmq-prin1 (zmq-subprocess-read))))))
                 (quit
                  (zmq-prin1 '(quit)))))))
        :filter (lambda (event)
                  ;; Show a message for every list the subprocess prints using `zmq-prin1'
                  (message "EVENT: %s" event)))))
  (unwind-protect
      (progn
        (sleep-for 0.2)
        (zmq-subprocess-send proc (list 'foo "bar"))
        (sleep-for 0.2))
    (delete-process proc)))

fleimgruber commented 5 years ago

Same error, see traceback.

About your question: Thinking about the history of this again I think there was a misunderstanding caused by me with regards to what worked on the previous machine. On the old machine, I just ran the emacs-zmq test suite successfully, but postponed the tests for emacs-jupyter. When https://github.com/dzop/emacs-zmq/issues/3#issuecomment-478243462 came up, I wanted to reproduce it and thought: Well, I will try this right away with the new machine. I first tested with emacs-jupyter since it was still on my TODO list and hit the zmq-ENOTSOCK issue. To be clear, the test suite make test (including the zmq-subprocess test) still succeeds.

nnicandro commented 5 years ago

The test suite passes because there are no tests that poll STDIN and doing that seems to be an issue on Windows, both with zmq-poll and zmq-poller. Polling in Windows seems to only work on WinSock objects, but STDIN is a file handle in Windows. I found some info about this on StackOverflow (https://stackoverflow.com/questions/396118/mixing-file-handles-and-sockets-in-winsock).

There is actually a way we can get rid of this problem in emacs-jupyter by not using a separate subprocess to handle messages when the OS is a Windows system. This would come at a slow down since all the message parsing would happen in the current Emacs process, but I've never tested if the slow down is bearable or not. Your Emacs will have to have been built with threading support (I believe this is the default) and you will have to replace jupyter-channel-ioloop-comm with jupyter-channel-comm at the following locations

It would be great if you could test this to see if it works for you.

I would like to find a solution that also works with jupyter-channel-ioloop-comm, the issue is that reading from STDIN using zmq-subprocess-read is blocking which is why we only want to read from STDIN only when necessary so that the subprocess can spend most of its time parsing messages instead of waiting on STDIN.

nnicandro commented 5 years ago

Can you let me know if the following works without error. The idea here is to, instead of reading from STDIN, create a new network connection to a port on the localhost and use that to have the parent Emacs process communicate with its subprocess. This avoids having the subprocess wait for input from stdin and allows it to continue to parse messages from the kernel as well as read input from its parent process.

(defvar jupyter-ioloop-pending-input nil)

(let* ((procs (make-network-process
               :name "sin"
               :host 'local
               :service t
               :type 'datagram
               :server 1
               :filter (lambda (proc data)
                         (cl-callf nconc jupyter-ioloop-pending-input
                           (list (read (decode-coding-string
                                        (base64-decode-string data)
                                        'utf-8-auto)))))))
       (procc (make-network-process
               :name "cin"
               :host 'local
               :type 'datagram
               :service (cadr (process-contact procs)))))
  (unwind-protect
      (with-temp-buffer
        (prin1 (list 'foo "bar") (current-buffer))
        (encode-coding-region (point-min) (point-max) 'utf-8-auto)
        (base64-encode-region (point-min) (point-max) t)
        (process-send-region procc (point-min) (point-max)))
    (sleep-for 1)
    (delete-process procc)
    (delete-process procs)
    (message "INPUT: %s" (pop jupyter-ioloop-pending-input))))

fleimgruber commented 5 years ago

With the code from https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487246846 I get this. If still relevant, I prepared the changes in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487219493, but I am not sure which snippet to test it with.

nnicandro commented 5 years ago

You should be able to test it by tryingjupyter-run-repl. Here is a snippet

(let* ((client (jupyter-run-repl "python"))
       (jupyter-inhibit-handlers t)
       (req (jupyter-send-execute-request client :code "1 + 1")))
  (unwind-protect
      (progn
        (jupyter-add-callback req
          :execute-result
          (jupyter-message-lambda ((res text/plain))
            (message "RESULT: %s" res)))
        (jupyter-wait-until-idle req)
        nil)
    (jupyter-stop-channels client)
    (jupyter-shutdown-kernel (oref client manager))))

fleimgruber commented 5 years ago

With that snippet I get this traceback.

nnicandro commented 5 years ago

Whoops, its supposed to be jupyter-sync-channel-comm.

fleimgruber commented 5 years ago

This seems to get somewhere, but I get a different error, see here.

nnicandro commented 5 years ago

OK at least we have something happening. Can you go to the function jupyter--split-identities and put

(cl-loop for part in parts
         do (message "PART: %s" part))

as the first form to evaluate in the function body, re-evaluate the function, and show the messages printed when you try again. The Jupyter messaging protocol specifies that there should be a message delimiter for each message sent, see jupyter-message-delimiter, and it looks like we are not getting it.

Also note, I have run the Jupyter test suite on my own OS X system using jupyter-sync-channel-comm and all the tests passed so I know that it at least works on Unix systems. This must be another Windows issue.

fleimgruber commented 5 years ago

These are the messages that get printed

PART: kernel.6ad25fa5-e2c0-40c0-b2d4-aaa03c56ecaf.status
PART: <IDS|MSG>
PART: 5a700c075fba8a9d30a0214f700ec7da538960fc6e67dc9244c95ead784dfd8c
PART: {"msg_id":"cec60788-9f7a010cc16f9a54ededc1af","msg_type":"status","username":"username","session":"bd935508-d27606159bb4d343ead4e09b","date":"2019-04-28T19:09:24.098976Z","version":"5.3"}
PART: {} [2 times]
PART: {"execution_state":"starting"}
while: peculiar error: "Message delimiter not in message list"

nnicandro commented 5 years ago

Hmmm... it looks like it is there. Can you do the same thing but with (message "PART: |%s|" part) just to see if there is any extra whitespace issues. The (equal part jupyter-message-delimiter) in jupyter--split-identities should catch the <IDS|MSG> part.

Another thing to try is to place a call to (setq test-parts parts) in jupyter--split-identities to get the list of parts before the error happens, remove the setq call after the error, and step through the function by calling M-x edebug-defun when point is somewhere inside jupyter--split-identities and then evaluating (jupyter--split-identities test-parts). This should drop you into edebug you can step through each call using SPC and you can exit edebug with q. After you are done you can put the function back into its normal state without the edebug instrumentation using M-x eval-defun. Using edebug should give you a clearer picture of where things are going wrong.

fleimgruber commented 5 years ago

Huh, it seems that now parts is always the empty string "" and I am not able to reproduce the messages printed above... also I sometimes get the error Kernel did not read connection file within timeout and then Emacs just crashes with "A fatal error has occurred!".

fleimgruber commented 5 years ago

It seems that there is something going on which I can't really reproduce. If I redo https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487287086 without the change suggested in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487405390 I see one of two errors

Kernel did not read connection file within timeout
Message delimiter not in message list

My guess is that the first one occurs if all is well with the parts (as has been in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487407118) and the second one occurs when parts is empty.

nnicandro commented 5 years ago

OK, could you remove the signal call in jupyter-sync-channel-comm--check. That function should not signal an error since the thread that checks for messages will exit and this may be the cause of the crash.

With regards to the timeout error, this may be due to the time resolution of a file's last access time on Windows. For synchronization purposes, we wait for a period of time to allow the kernel process to startup after it is launched before proceeding with the rest of the initialization. Its not strictly necessary, but useful for kernels that take a long time to startup. To determine when to stop waiting we check the last access time of the connection file that the kernel uses to determine which ports to use for the Jupyter channels.

To get the access time, the function file-attributes is used and it has some caveats about Windows systems toward the bottom of its documentation string. Also the Windows documentation (https://docs.microsoft.com/en-us/windows/desktop/sysinfo/file-times) doesn't look very promising either.

To work around this you can add (memq system-type '(windows-nt cygwin ms-dos)) to the or form here.

With regards to the message delimiter error, I'm not sure how that could happen. Are you saying that parts is sometimes '("") or ""? It should never be the latter and if its the former then it indicates that there is something wrong with sending/receiving messages so more on the ZMQ side than Jupyter. Maybe this is related to the issue with jupyter-sync-channel-comm--check mentioned above? You can run the following snippet a few times to see if we are reliably receiving messages on ZMQ sockets

(let* ((send (zmq-socket (zmq-current-context) zmq-PUSH))
       (recv (zmq-socket (zmq-current-context) zmq-PULL))
       (msg '("kernel.6ad25fa5-e2c0-40c0-b2d4-aaa03c56ecaf.status"
              "<IDS|MSG>"
              "5a700c075fba8a9d30a0214f700ec7da538960fc6e67dc9244c95ead784dfd8c"
              "{\"msg_id\":\"cec60788-9f7a010cc16f9a54ededc1af\",\"msg_type\":\"status\",\"username\":\"username\",\"session\":\"bd935508-d27606159bb4d343ead4e09b\",\"date\":\"2019-04-28T19:09:24.098976Z\",\"version\":\"5.3\"}"
              "{}"
              "{}"
              "{\"execution_state\":\"starting\"}"))
       port)
  (unwind-protect
      (progn
        (setq port (zmq-bind-to-random-port send "tcp://127.0.0.1"))
        (zmq-connect recv (format "tcp://127.0.0.1:%s" port))
        (zmq-send-multipart send msg)
        (message "EQUAL: %s" (equal msg (zmq-recv-multipart recv))))
    (zmq-close send)
    (zmq-close recv)))

fleimgruber commented 5 years ago

I removed the signal call and added the system-type check. With these changes your snippet messages "EQUAL: t" and does so consistently.

nnicandro commented 5 years ago

OK, I've been able to setup the Jupyter tests on Appveyor (https://github.com/dzop/emacs-jupyter/tree/appveyor) so hopefully I can get these issues ironed out. Are you still getting the Message delimiter not in message list error?

nnicandro commented 5 years ago

Success! If you pull down the appveyor branch emacs-jupyter should work. I was able to get the majority of tests to pass. See https://ci.appveyor.com/project/dzop/emacs-jupyter. I'll merge these changes into the master branch one I figure out how to get all of them to pass.

Thanks for helping me track down these bugs, its greatly appreciated.

fleimgruber commented 5 years ago

I can confirm that the original use case (org-mode python src block) works with the appveyor branch.

Good job! Keep up the excellent work. Thanks for committing to support Windows, it is much appreciated. Also, always happy to help and learn.

nnicandro commented 5 years ago

Closing since the issues discussed have been fixed as of dzop/emacs-jupyter@0357961777b147c1907349530947708da96d451b.

nnicandro / emacs-zmq

Windows: Error zmq-ENOTSOCK #10