pebbe / zmq4

A Go interface to ZeroMQ version 4
BSD 2-Clause "Simplified" License
1.18k stars 163 forks source link

AuthStop() and Term() causes a hang on shutdown #55

Closed aletheia7 closed 9 years ago

aletheia7 commented 9 years ago

zmq4 version: 4b5b0725b5d36cef61e2744bd7c93067180e3edf (latest) zeromq-4.1.3 (latest) libsodium: 1.0.3 (latest)

There appears to be a race condition with the AuthStop() and Term() functions. Steps to reproduce the problem:

  1. Run the below code
  2. Wait 2 seconds and ctrl-c.
  3. It will hang within 3-5 tries.
  4. A second ctrl-c exists the program.

Comment-out the AuthStop() and Term() code below and the program exits cleanly upon ctrl-c.

Workaround: Do not use AuthStop() and Term(). The program will end just fine.

package main

import (
        "log"
        "os"
        "os/signal"
        "syscall"
        "time"

        z "github.com/pebbe/zmq4"
)

func main() {
        us_public_key := "VVPVGyASr0?}lkRR28ULJj}N4Ut-NC>Q?qZ6!F3N"
        us_secret_key := "lU9<Y]^0CG*gI6N7Gyx.Sb/Q5:SX=[3moZv>B<c9"
        them_public_key := "K!y%@qX/ukA=Tw0E+#Lt.]k{Xu9yzZFf66TK&L@R"
        defer log.Println("After Term(), stopped")
        defer z.Term() // Comment-out to avoid hang
        defer log.Println("before Term()")
        z.AuthStart()
        defer z.AuthStop() // Comment-out to avoid hang
        defer log.Println("before AuthStop()")
        z.AuthCurveAdd("*", them_public_key)
        defer z.AuthCurveRemoveAll("*")
        sock, _ := z.NewSocket(z.ROUTER)
        defer sock.Close()
        sock.SetIdentity("a")
        sock.ClientAuthCurve(them_public_key, us_public_key, us_secret_key)
        connect := "tcp://127.0.0.1:5000"
        sock.SetReconnectIvl(time.Second * 1)
        sock.SetReconnectIvlMax(time.Second * 1)
        sock.Connect(connect)
        defer sock.Disconnect(connect)
        defer sock.SetLinger(0)
        sigc := make(chan os.Signal, 10)
        signal.Notify(sigc, os.Interrupt, syscall.SIGTERM)
        log.Println("Waiting ..")
        <-time.After(3 * time.Second)
        log.Println("ok to ctrl-c")
        <-sigc
        log.Println("received sigc")
}

Also hangs with zeromq-3.2.5.

Thank you for creating the golang library. Works great.

pebbe commented 9 years ago

I have tried this many times, and it never hangs.

2015/08/20 12:21:27 Waiting ..
2015/08/20 12:21:30 ok to ctrl-c
^C2015/08/20 12:21:32 received sigc
2015/08/20 12:21:32 before AuthStop()
2015/08/20 12:21:32 before Term()
2015/08/20 12:21:32 After Term(), stopped

I tried ZeroMQ versions 4.0.7 and 4.1.3 Version 3.2.5 doesn't work because it doesn'y have the Auth functions. I use libsodium 0.4.3

aletheia7 commented 9 years ago

Hmm. How about removing the 3 second timeout? This is repeatable. I'm running Debian testing/stretch. Also was happening on testing/jessie. Try it 5 times in a row without the timeout. It will happen.

aletheia7 commented 9 years ago

I'll provide a shell log of the output today when I get back to the office.

pebbe commented 9 years ago

I tried more then 30 times. It never hangs. Also tried with libsodium 1.0.3

ldd:

        linux-vdso.so.1 =>  (0x00007fff899ff000)
        libzmq.so.5 => /my/opt/zeromq-4.1.3/lib/libzmq.so.5 (0x00007f84ac707000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x00007f84ac4c6000)
        libc.so.6 => /lib/libc.so.6 (0x00007f84ac163000)
        libsodium.so.13 => /my/opt/libsodium-1.0.3/lib/libsodium.so.13 (0x00007f84abf0d000)
        libnorm.so.1 => /my/opt/norm-1.5b4/lib/libnorm.so.1 (0x00007f84abbd3000)
        librt.so.1 => /lib/librt.so.1 (0x00007f84ab9ca000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f84ab6b6000)
        libm.so.6 => /lib/libm.so.6 (0x00007f84ab434000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f84ab21d000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f84ac971000)
aletheia7 commented 9 years ago

I'm fuzzy with the 3.2.5 comment. The problem has been happening for months and I upgraded zeromq. My directory had the 3.2.5 tgz. I've also used many versions of zeromq 4.x. I'll provide the ldd output and shell log next. Thank you for your support.

aletheia7 commented 9 years ago

ldd ~/go/bin/hang

$ ldd ~/go/bin/hang
        linux-gate.so.1 (0xb7796000)
        libzmq.so.5 => /usr/local/lib/libzmq.so.5 (0xb770e000)
        libpthread.so.0 => /lib/i386-linux-gnu/i686/cmov/libpthread.so.0 (0xb76f2000)
        libc.so.6 => /lib/i386-linux-gnu/i686/cmov/libc.so.6 (0xb7547000)
        libsodium.so.13 => /usr/local/lib/libsodium.so.13 (0xb74c6000)
        librt.so.1 => /lib/i386-linux-gnu/i686/cmov/librt.so.1 (0xb74bd000)
        libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb73bc000)
        libm.so.6 => /lib/i386-linux-gnu/i686/cmov/libm.so.6 (0xb7376000)
        libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7358000)
        /lib/ld-linux.so.2 (0x80029000)
$ 
aletheia7 commented 9 years ago

script

Script started on Thu 20 Aug 2015 06:20:29 PM PDT
$ hang
2015/08/20 18:20:34 Waiting ..
^C2015/08/20 18:20:37 ok to ctrl-c
2015/08/20 18:20:37 received sigc
2015/08/20 18:20:37 before AuthStop()
^C
^C^Z
[1]+  Stopped                 hang
$ kill -9 %1

[1]+  Stopped                 hang
$ exit
exit

Script done on Thu 20 Aug 2015 06:20:58 PM PDT
$ 
aletheia7 commented 9 years ago
debian linux

The system is completely updated:

apt-get update && apt-get upgrade && apt-get dist-upgrade
$ go version
go version go1.4.2 linux/386
$ uname -a
Linux incense 4.1.0-1-686-pae #1 SMP Debian 4.1.3-1 (2015-08-03) i686 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux testing (stretch)
Release:        testing
Codename:       stretch
$
aletheia7 commented 9 years ago

I also brought the above sample code to another host and confirmed the hang problem.

aletheia7 commented 9 years ago

AuthStop() hangs here https://github.com/pebbe/zmq4/blob/master/auth.go#L335

pebbe commented 9 years ago
Script started on Thu 20 Aug 2015 06:20:29 PM PDT

You are calling the program through a script? How does that look like?

aletheia7 commented 9 years ago

I only ran it in script to provide the output. I do not use script to run the hang program normally.

The script command forks a shell and redirects the output to a file called "typescript." I posted the output above. After typing exit, script appends "Script done ..." to the output.

pebbe commented 9 years ago

Please insert this line before Auth.Start() and show me the output:

z.AuthSetVerbose(true)
pebbe commented 9 years ago

What version of Go are you using? I will try that.

aletheia7 commented 9 years ago

I actually did that after I posted the script output. The AuthStop() function would print Line 333 and then hang. I added log.Println() before and after Line 333. This is how I determined Line 335 was the stopping point.

aletheia7 commented 9 years ago

Go version is above. It's go1.4.2.

aletheia7 commented 9 years ago

What system are you running?

pebbe commented 9 years ago

I was using Go 1.5 First try with Go 1.4.2 and it hangs. Now I can do some debugging.

aletheia7 commented 9 years ago

My guess is that the ctrl-c signal is being caught by the go runtime and the zeromq library at the same time. The ctrl-c is caught by the zeromq thread, and kills the auth_handler socket. When the auth_quit calls SendMessage, there is no receiving socket.

I should have run the program under unix strace at the office.

Good, 1.5 does not have a problem. Debian hasn't shipped 1.5 yet. I'll try that at the office today.

pebbe commented 9 years ago

I just pushed an update to github. Go 1.4.2 should now work without problems.

aletheia7 commented 9 years ago

I upgraded to the lastest. Here is the output:

Script started on Fri 21 Aug 2015 02:34:49 PM PDT
dev$ hang
2015/08/21 14:35:16 AUTH: Starting
2015/08/21 14:35:16 Waiting ..
2015/08/21 14:35:16 ok to ctrl-c
^C2015/08/21 14:35:17 AUTH: Quitting: interrupted system call
2015/08/21 14:35:17 AUTH: Quit
2015/08/21 14:35:17 received sigc
2015/08/21 14:35:17 before AuthStop()
2015/08/21 14:35:17 AUTH: Stopping
2015/08/21 14:35:17 AUTH: Stopping: SendMessageDontwait("QUIT"): resource temporarily unavailable
2015/08/21 14:35:17 AUTH: Stopping: RecvMessage: Operation cannot be accomplished in current state
2015/08/21 14:35:17 AUTH: Stopped
2015/08/21 14:35:17 before Term()
2015/08/21 14:35:17 After Term(), stopped
dev$ exit
exit

Script done on Fri 21 Aug 2015 02:35:19 PM PDT

Is this the expected output?

pebbe commented 9 years ago

Yes. This, or another version where Go catches the signal first.

aletheia7 commented 9 years ago

Fixed. Thank you.