pebbe / zmq4

A Go interface to ZeroMQ version 4
BSD 2-Clause "Simplified" License
1.18k stars 164 forks source link

Incorrect errno #164

Open Miosss opened 4 years ago

Miosss commented 4 years ago

The problem

I issue non-blocking read on DEALER socket connected to ROUTER socket. data, err := client.RecvMessage(zmq.DONTWAIT)

ROUTER takes at least 1 second to complete the task (due to sleep()) and I do the read immediately.

I expected to get EAGAIN error, but instead I got err == nil and len(data) == 0 - proper empty read.

Situation

By debugging the library it seems to me, that this call starts the error (RecvBytes, zmq4.go:1077): size, err = C.zmq_msg_recv(&msg, soc.soc, C.int(flags))

Here, size == -1 but err == nil. Therefore errget(err) with nil returns nil instead of true error.

Maybe errget should do something when it is call with nil argument?

I believe that the root cause of this particular problem is not using zmq_errno. In the documentation of that function it is said, that it should be used to properly get errno, when for example in a situation, where the application links to different C runtime, than the libzmq.

This is probably my case, because this happens on Windows, I have libzmq.dll built with MSVC and then generated stub libzmq.a using gcc dlltools. So the setup is exotic (but hey, welcome to compiling C libs on Windows + Go + Cgo). What's more, during C. calls in Go, it returns plain errno and it is essentialy wrong in this case.

When I tried e := C.zmq_errno() just after the failed read - I get the correct EAGAIN (11) error.

Solutions?

While I probably could check C.zmq_errno() after each call, but I am not sure if it is sufficient enough and will the errors be cleared after succesful calls? EDIT: No, the error is not cleared. And since the returned err is nil, there is no way to now that C.zmq_errno() result is valid in this situation (+ all the threading issues possible).

One solution may be to drop all err from _, err := C. ... and call C.zmq_errno() instead? But it will require changes in many places.

Maybe modifications to errget will be sufficient? For example if argument err is nil the check the C.zmq_errno() ?

pebbe commented 4 years ago

Forget about the previous comment. It's all wrong. An interrupted signal call gives a EINTR, not a EAGAIN. I undid the changes.

So what is the problem exactly? Provide code that demonstrates.

Miosss commented 4 years ago

@pebbe I am not sure if it is easily reproducible. I believe that the main reason behind this is exactly what zmq_errno() is for. I found it through this SO

Look at the definition of this function: int zmq_errno (void) { return errno; }

It returns just the errno - but from the context of the library itself.

In my case I have libzmq.dll built with MSVC, but I use gcc from MSYS2 for CGO. Therefore there may be the problem with proper propagation of the errno - situation described in here. Your error handling relies on what C calling subsystem in GO gives you here: size, err = C.zmq_msg_recv(&msg, soc.soc, C.int(flags))

zmq_msg_recv only returns the size of the message, the err is given by golang:

Any C function (even void functions) may be called in a multiple assignment context to retrieve both the return value (if any) and the C errno variable as an error (use _ to skip the result value if the function returns void). For example:

by godoc

So this err is basically the same as just reading errno (which you in fact do in errget).

The problem is - errno in dll may be different errno in the app. libzmq sets errno which only resides in dll, and in my app errno is always 0.

I understand that this problem is why zmq_errno came to be.

The problem itself

I run msg, err := client.RecvMessage(zmq.DONTWAIT)

I am sure that there is no message in queue - I should receive msg = nil and err = EAGAIN. This doesn't happen. I get msg = []byte("") (empty message) and err = nil.

By debugging your code I can see that: size, err = C.zmq_msg_recv(&msg, soc.soc, C.int(flags)) in this example returns size = -1 and err = nil. Size = -1 clearly indicates that there IS and error, but Go gives you err = nil. In the next if you check the size to see if there is an error (and there is) and to get the actual error - you look into err. Which is nil.

So, size tells that there is an error, and err says there is none. To me, the cause is in what I wrote in the begging - libzmq sets different errno, than CGO returns. You should probably check zmq_errno instead.

Miosss commented 4 years ago

And look at this quote from zmq.h:

/ This function retrieves the errno as it is known to 0MQ library. The goal / / of this function is to make the code 100% portable, including where 0MQ / / compiled with certain CRT library (on Windows) is linked to an / / application that uses different CRT library. / ZMQ_EXPORT int zmq_errno (void);

pebbe commented 4 years ago

I think I may have a fix. Can you try the latest version, please?

Miosss commented 4 years ago

The same situation happens when Binding to the same TCP port for the second time - it silently fails, but without an error. Therefore the process thinks that it can accept messages, while the underlying socket is dead.

This occurs in Bind (zmq4.go963): i, err = C.zmq4_bind(soc.soc, s)

i = -1 but err = nil, so the same as previously.

I have downloaded your latest version and it seems fixed - this case (binding) now correctly returns error (though I am not sure why is it error 100 -> Cannot create another system semaphore, but this must be libzmq thing). I have not yet tested the EAGAIN case, but I believe it is the same as for Bind.