Closed fleimgruber closed 5 years ago
Can you try changing utf-8-unix
to utf-8-auto
in zmq-subprocess-read
and try again.
I think the ("Unknown error")
you are getting from the subprocess is actually from
don't know why its reported as zmq-ENOTSOCK
.
I applied the suggested change and tried again. I get basically the same error (with some obvious differences wrt hashes and ports), see updated gist.
Hmmm..., I still feel like it has to do with encoding since the link I mentioned is the only place where "Unknown error"
seems to occur, both in the Emacs sources and the libzmq sources. Could you comment out the line
https://github.com/dzop/emacs-zmq/blob/b35b0b5fcd152a959fb9fcabb8161033ca073828/zmq.el#L506
and see if that changes anything. Also can you show the value of default-process-coding-system
on the Windows machine that doesn't work and also on the one that does.
default-process-coding-system
has the value (utf-8-dos . utf-8-unix)
on the machine that does not work. On the same machine, commenting out the line you suggested led to the same error zmq-ENOTSOCK
. The machine I used in #3 is not accessible at the moment, because of a switch of company hardware. The current tests work though, so maybe we could try to come up with a failing test case for this issue and work on from there?
From the backtrace you posted, the error happens during the jupyter-ioloop-wait-until
call when waiting for the channel subprocess to send a start
event. The function below should reproduce the error.
(cl-defmethod jupyter-ioloop-handler ((_ioloop jupyter-channel-ioloop)
(_ (eql :emacs-zmq-issue10))
event)
(message "EVENT: %s" event))
(defun emacs-zmq-issue10 ()
(let ((ioloop (jupyter-channel-ioloop))
(session (jupyter-session :key "foo")))
(jupyter-ioloop-start ioloop session :emacs-zmq-issue10)
(unwind-protect
(jupyter-ioloop-wait-until ioloop 'start #'identity)
(delete-process (oref ioloop process)))))
OK, I've stripped down my previous code so that Jupyter isn't required. You should also be able to reproduce the error with the following code
(let ((proc
(zmq-start-process
`(lambda (ctx)
(let ((poller (zmq-poller)))
(zmq-poller-add poller 0 zmq-POLLIN)
(let (events)
(condition-case nil
(progn
(zmq-prin1 '(start))
(while t
(setq events (condition-case nil
(zmq-poller-wait-all poller 1 200)
((zmq-EAGAIN zmq-EINTR zmq-ETIMEDOUT) nil)))
(let ((stdin-event (assq 0 events)))
(when stdin-event
(setq events (delq stdin-event events))
;; Whenever the parent process sends something using
;; `zmq-subprocess-send', just print it back to the parent
;; Emacs process.
(zmq-prin1 (zmq-subprocess-read))))))
(quit
(zmq-prin1 '(quit)))))))
:filter (lambda (event)
;; Show a message for every list the subprocess prints using `zmq-prin1'
(message "EVENT: %s" event)))))
(unwind-protect
(progn
(sleep-for 0.2)
(zmq-subprocess-send proc (list 'foo "bar"))
(sleep-for 0.2))
(delete-process proc)))
Thanks, I can confirm this reproduces, see backtrace here. Now I am wondering how we could pick it apart to get at the root cause.
Can you replace
(zmq-poller-wait-all poller 1 200)
with
(progn
(sleep-for 0.2)
nil)
and see if you still get the error.
That change does not cause any error and prints this to *Messages*
EVENT: (start)
nil
OK, I was thinking that it may have something to do with polling STDIN since you got the zmq-ENOTSOCK
as the error symbol. Still not sure why you get that error.
Can you then put back the zmq-poller-wait-all
call and replace
(zmq-subprocess-read)
with
(read (base64-decode-string (read-minibuffer "")))
If that still gives you the error, then remove the whole (let ((stdin-event (assq 0 events))) ...
block and try again. If you still get an error in this case, it probably does have to do with polling STDIN.
The first one, zmq-poller-wait-all
and (read (base64-decode-string (read-minibuffer "")))
gives the same error as does the variant without the (let ((stdin-event (assq 0 events))) ...
part.
OK, it does look like an issue with trying to poll STDIN using zmq. From the zmq documentation http://api.zeromq.org/master:zmq-poller
zmq_poller_add_fd, zmq_poller_modify_fd and zmq_poller_remove_fd are analogous to the previous functions but manage regular file descriptiors registered with a poller. On Windows, these functions can only be used with WinSock sockets.
zmq_poller_add_fd
is what gets called by (zmq-poller-add poller 0 zmq-POLLIN)
. So it looks like the zmq-ENOTSOCK
error is happening because we are trying to poll 0
(STDIN) which is not a WinSock, but a file handle in Windows.
The question is why did this previously work? What version of ZMQ are you using, i.e. what is the output of (zmq-version)
? And do you remember if this was different whenever you had it working on your old machine?
Maybe we can use zmq-poll
instead, although I believe the zmq-poller
interface is meant as a replacement for zmq-poll
. From the documentation of zmq-poll
there is no mention of any restriction about WinSock sockets. Can you try again with the following
(let ((proc
(zmq-start-process
`(lambda (ctx)
(let ((poll-items (list (cons 0 (list zmq-POLLIN)))))
(let (events)
(condition-case nil
(progn
(zmq-prin1 '(start))
(while t
(setq events (condition-case nil
(zmq-poll poll-items 200)
((zmq-EAGAIN zmq-EINTR zmq-ETIMEDOUT) nil)))
(let ((stdin-event (assq 0 events)))
(when stdin-event
(setq events (delq stdin-event events))
;; Whenever the parent process sends something using
;; `zmq-subprocess-send', just print it back to the parent
;; Emacs process.
(zmq-prin1 (zmq-subprocess-read))))))
(quit
(zmq-prin1 '(quit)))))))
:filter (lambda (event)
;; Show a message for every list the subprocess prints using `zmq-prin1'
(message "EVENT: %s" event)))))
(unwind-protect
(progn
(sleep-for 0.2)
(zmq-subprocess-send proc (list 'foo "bar"))
(sleep-for 0.2))
(delete-process proc)))
Same error, see traceback.
About your question: Thinking about the history of this again I think there was a misunderstanding caused by me with regards to what worked on the previous machine. On the old machine, I just ran the emacs-zmq
test suite successfully, but postponed the tests for emacs-jupyter
. When https://github.com/dzop/emacs-zmq/issues/3#issuecomment-478243462 came up, I wanted to reproduce it and thought: Well, I will try this right away with the new machine. I first tested with emacs-jupyter
since it was still on my TODO list and hit the zmq-ENOTSOCK issue. To be clear, the test suite make test
(including the zmq-subprocess
test) still succeeds.
The test suite passes because there are no tests that poll STDIN and doing that seems to be an issue on Windows, both with zmq-poll
and zmq-poller
. Polling in Windows seems to only work on WinSock objects, but STDIN is a file handle in Windows. I found some info about this on StackOverflow (https://stackoverflow.com/questions/396118/mixing-file-handles-and-sockets-in-winsock).
There is actually a way we can get rid of this problem in emacs-jupyter
by not using a separate subprocess to handle messages when the OS is a Windows system. This would come at a slow down since all the message parsing would happen in the current Emacs process, but I've never tested if the slow down is bearable or not. Your Emacs will have to have been built with threading support (I believe this is the default) and you will have to replace jupyter-channel-ioloop-comm
with jupyter-channel-comm
at the following locations
It would be great if you could test this to see if it works for you.
I would like to find a solution that also works with jupyter-channel-ioloop-comm
, the issue is that reading from STDIN using zmq-subprocess-read
is blocking which is why we only want to read from STDIN only when necessary so that the subprocess can spend most of its time parsing messages instead of waiting on STDIN.
Can you let me know if the following works without error. The idea here is to, instead of reading from STDIN, create a new network connection to a port on the localhost and use that to have the parent Emacs process communicate with its subprocess. This avoids having the subprocess wait for input from stdin and allows it to continue to parse messages from the kernel as well as read input from its parent process.
(defvar jupyter-ioloop-pending-input nil)
(let* ((procs (make-network-process
:name "sin"
:host 'local
:service t
:type 'datagram
:server 1
:filter (lambda (proc data)
(cl-callf nconc jupyter-ioloop-pending-input
(list (read (decode-coding-string
(base64-decode-string data)
'utf-8-auto)))))))
(procc (make-network-process
:name "cin"
:host 'local
:type 'datagram
:service (cadr (process-contact procs)))))
(unwind-protect
(with-temp-buffer
(prin1 (list 'foo "bar") (current-buffer))
(encode-coding-region (point-min) (point-max) 'utf-8-auto)
(base64-encode-region (point-min) (point-max) t)
(process-send-region procc (point-min) (point-max)))
(sleep-for 1)
(delete-process procc)
(delete-process procs)
(message "INPUT: %s" (pop jupyter-ioloop-pending-input))))
With the code from https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487246846 I get this. If still relevant, I prepared the changes in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487219493, but I am not sure which snippet to test it with.
You should be able to test it by tryingjupyter-run-repl
. Here is a snippet
(let* ((client (jupyter-run-repl "python"))
(jupyter-inhibit-handlers t)
(req (jupyter-send-execute-request client :code "1 + 1")))
(unwind-protect
(progn
(jupyter-add-callback req
:execute-result
(jupyter-message-lambda ((res text/plain))
(message "RESULT: %s" res)))
(jupyter-wait-until-idle req)
nil)
(jupyter-stop-channels client)
(jupyter-shutdown-kernel (oref client manager))))
With that snippet I get this traceback.
Whoops, its supposed to be jupyter-sync-channel-comm
.
This seems to get somewhere, but I get a different error, see here.
OK at least we have something happening. Can you go to the function jupyter--split-identities
and put
(cl-loop for part in parts
do (message "PART: %s" part))
as the first form to evaluate in the function body, re-evaluate the function, and show the messages printed when you try again. The Jupyter messaging protocol specifies that there should be a message delimiter for each message sent, see jupyter-message-delimiter
, and it looks like we are not getting it.
Also note, I have run the Jupyter test suite on my own OS X system using jupyter-sync-channel-comm
and all the tests passed so I know that it at least works on Unix systems. This must be another Windows issue.
These are the messages that get printed
PART: kernel.6ad25fa5-e2c0-40c0-b2d4-aaa03c56ecaf.status
PART: <IDS|MSG>
PART: 5a700c075fba8a9d30a0214f700ec7da538960fc6e67dc9244c95ead784dfd8c
PART: {"msg_id":"cec60788-9f7a010cc16f9a54ededc1af","msg_type":"status","username":"username","session":"bd935508-d27606159bb4d343ead4e09b","date":"2019-04-28T19:09:24.098976Z","version":"5.3"}
PART: {} [2 times]
PART: {"execution_state":"starting"}
while: peculiar error: "Message delimiter not in message list"
Hmmm... it looks like it is there. Can you do the same thing but with (message "PART: |%s|" part)
just to see if there is any extra whitespace issues. The (equal part jupyter-message-delimiter)
in jupyter--split-identities
should catch the <IDS|MSG>
part.
Another thing to try is to place a call to (setq test-parts parts)
in jupyter--split-identities
to get the list of parts before the error happens, remove the setq
call after the error, and step through the function by calling M-x edebug-defun
when point
is somewhere inside jupyter--split-identities
and then evaluating (jupyter--split-identities test-parts)
. This should drop you into edebug
you can step through each call using SPC
and you can exit edebug
with q
. After you are done you can put the function back into its normal state without the edebug
instrumentation using M-x eval-defun
. Using edebug
should give you a clearer picture of where things are going wrong.
Huh, it seems that now parts
is always the empty string ""
and I am not able to reproduce the messages printed above... also I sometimes get the error Kernel did not read connection file within timeout
and then Emacs just crashes with "A fatal error has occurred!".
It seems that there is something going on which I can't really reproduce. If I redo https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487287086 without the change suggested in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487405390 I see one of two errors
My guess is that the first one occurs if all is well with the parts
(as has been in https://github.com/dzop/emacs-zmq/issues/10#issuecomment-487407118) and the second one occurs when parts
is empty.
OK, could you remove the signal
call in jupyter-sync-channel-comm--check
. That function should not signal an error since the thread that checks for messages will exit and this may be the cause of the crash.
With regards to the timeout error, this may be due to the time resolution of a file's last access time on Windows. For synchronization purposes, we wait for a period of time to allow the kernel process to startup after it is launched before proceeding with the rest of the initialization. Its not strictly necessary, but useful for kernels that take a long time to startup. To determine when to stop waiting we check the last access time of the connection file that the kernel uses to determine which ports to use for the Jupyter channels.
To get the access time, the function file-attributes
is used and it has some caveats about Windows systems toward the bottom of its documentation string. Also the Windows documentation (https://docs.microsoft.com/en-us/windows/desktop/sysinfo/file-times) doesn't look very promising either.
To work around this you can add (memq system-type '(windows-nt cygwin ms-dos))
to the or
form here.
With regards to the message delimiter error, I'm not sure how that could happen. Are you saying that parts
is sometimes '("")
or ""
? It should never be the latter and if its the former then it indicates that there is something wrong with sending/receiving messages so more on the ZMQ side than Jupyter. Maybe this is related to the issue with jupyter-sync-channel-comm--check
mentioned above? You can run the following snippet a few times to see if we are reliably receiving messages on ZMQ sockets
(let* ((send (zmq-socket (zmq-current-context) zmq-PUSH))
(recv (zmq-socket (zmq-current-context) zmq-PULL))
(msg '("kernel.6ad25fa5-e2c0-40c0-b2d4-aaa03c56ecaf.status"
"<IDS|MSG>"
"5a700c075fba8a9d30a0214f700ec7da538960fc6e67dc9244c95ead784dfd8c"
"{\"msg_id\":\"cec60788-9f7a010cc16f9a54ededc1af\",\"msg_type\":\"status\",\"username\":\"username\",\"session\":\"bd935508-d27606159bb4d343ead4e09b\",\"date\":\"2019-04-28T19:09:24.098976Z\",\"version\":\"5.3\"}"
"{}"
"{}"
"{\"execution_state\":\"starting\"}"))
port)
(unwind-protect
(progn
(setq port (zmq-bind-to-random-port send "tcp://127.0.0.1"))
(zmq-connect recv (format "tcp://127.0.0.1:%s" port))
(zmq-send-multipart send msg)
(message "EQUAL: %s" (equal msg (zmq-recv-multipart recv))))
(zmq-close send)
(zmq-close recv)))
I removed the signal
call and added the system-type
check. With these changes your snippet messages "EQUAL: t"
and does so consistently.
OK, I've been able to setup the Jupyter tests on Appveyor (https://github.com/dzop/emacs-jupyter/tree/appveyor) so hopefully I can get these issues ironed out. Are you still getting the Message delimiter not in message list
error?
Success! If you pull down the appveyor
branch emacs-jupyter
should work. I was able to get the majority of tests to pass. See https://ci.appveyor.com/project/dzop/emacs-jupyter. I'll merge these changes into the master
branch one I figure out how to get all of them to pass.
Thanks for helping me track down these bugs, its greatly appreciated.
I can confirm that the original use case (org-mode python src block) works with the appveyor
branch.
Good job! Keep up the excellent work. Thanks for committing to support Windows, it is much appreciated. Also, always happy to help and learn.
Closing since the issues discussed have been fixed as of dzop/emacs-jupyter@0357961777b147c1907349530947708da96d451b.
During the follow up in https://github.com/dzop/emacs-zmq/issues/3#issuecomment-480388380, I tried debugging on another Windows machine and ran into another error. Since it surfaces in the subprocessing, it might be related, but I thought opening a new issue would be good in any case. Using the precompiled DLLs, the tests work from the command line. But when actually used from within a GUI emacs with an org-mode source block, I get the error
error in process filter: progn: Error in ZMQ subprocess: zmq-ENOTSOCK
.With
(setq zmq--subprocess-debug t)
here is the full traceback: https://gist.github.com/fleimgruber/7b0702b6f9beda90f125885de8be27d5