Closed dxn1988 closed 5 years ago
Hmm. So, if you kill power to the server, the other side of the tcp connection (nsqd) does not know, and it takes a while to decide that TCP connection is gone (a very long while if tcp keep-alives are not enabled). If just the process crashes or exits somehow, then the OS kernel closes those file descriptors, and thus the tcp connection, which involves sending a FIN or RST, so in this case the other side (nsqd) does know right away. But in this case, the consumer process is still running, and it notices that the connection was closed, and nsqd also notices pretty quickly that it is closed. So this is something else, and I don't have any great ideas what. Maybe check "sudo dmesg". Maybe just replace this server and see if this issue ever comes back :)
Unfortunately this is probably the kind thing where we can't debug it. If you can debug it, and offer a fix with a convincing explanation, we can judge that and may accept it. But there may not be anything we can do, sorry.
Gonna close, but please re-open if you find any indication that this could be a bug in nsqd
or go-nsq
.
@ploxiln Thanks for your reply.
The server our nsq service pushes data to is our client's business server, so it's highly unlikely because the server is offline or other issues related with the server status.
Another thing I didn't mention above is the max_in_flight
setting, we set max_in_flight
to 10 but we are not sure whether our client's server could handle all our data immediately and also the quality of connection. Could this have something to do with the problem?
Also because of some 'external' reasons, we can't get a technician from our client to do necessary debugging or testing for us :(
Ah - then my best guess is a stateful firewall is involved. max_in_flight
"too high" would probably just result in messages timing out.
Can I ask, how did you solve it? I found the same problem, sometimes running for a while will have the same problem as you.
letUnSolveStep/Chegg] (8.219.156.21:4150) IO error - write tcp 192.168.31.108:61023->8.219.156.21:4150: write: broken pipe
ERR 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) error sending command FIN 11e68f9fb8ec4000 - write tcp 192.168.31.108:61023->8.219.156.21:4150: write: broken pipe
INF 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) beginning close
INF 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) breaking out of writeLoop
INF 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) writeLoop exiting
WRN 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) delaying close, 2 outstanding messages
INF 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) readLoop exiting
WRN 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) draining... waiting for 2 messages in flight
WRN 2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) draining... waiting for 2 messages in flight
I have few consumer connections in my nsq, one of them will close for no reason at some time but not very frequently, and the connection will be resumed after restarting nsq service.
Here's the error log:
2019/01/25 10:05:43 �[1;34m[I] 2 ERR 1 [blackboxdjson/gjjzxzl] (***************:9150) IO error - write tcp 192.168.0.103:34208->**********:9150: write: broken pipe �[0m 2019/01/25 10:05:43 �[1;34m[I] 2 ERR 1 [blackboxdjson/gjjzxzl] (***************:9150) error sending command FIN 0aeeacedfe4c3002 - write tcp 192.168.0.103:34208->*******************:9150: write: broken pipe �[0m 2019/01/25 10:05:43 �[1;34m[I] 2 INF 1 [blackboxdjson/gjjzxzl] (***************:9150) beginning close �[0m 2019/01/25 10:05:43 �[1;34m[I] 2 INF 1 [blackboxdjson/gjjzxzl] (***************:9150) breaking out of writeLoop �[0m 2019/01/25 10:05:43 �[1;34m[I] 2 INF 1 [blackboxdjson/gjjzxzl] (***************:9150) writeLoop exiting �[0m 2019/01/25 10:05:44 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:45 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:46 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:47 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:48 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:49 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:50 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:52 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:53 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:54 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:55 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:56 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:57 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:58 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m 2019/01/25 10:05:59 �[1;34m[I] 2 WRN 1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
I have no idea why the connection is closed but I have few assumptions:
- The receiver server is not stable or offline. I've done some test about this scenario, the
Connections
in nsqadmin remains 1 after I make my test server offline. But for the problem above, theConnections
is 0.- Data for this connection is usually larger than other connections, but still far from reaching the limit set in
OutputBufferSize
.@ploxiln Thanks for your reply.
The server our nsq service pushes data to is our client's business server, so it's highly unlikely because the server is offline or other issues related with the server status.
Another thing I didn't mention above is the
max_in_flight
setting, we setmax_in_flight
to 10 but we are not sure whether our client's server could handle all our data immediately and also the quality of connection. Could this have something to do with the problem?Also because of some 'external' reasons, we can't get a technician from our client to do necessary debugging or testing for us :(
I have few consumer connections in my nsq, one of them will close for no reason at some time but not very frequently, and the connection will be resumed after restarting nsq service.
Here's the error log:
I have no idea why the connection is closed but I have few assumptions:
The receiver server is not stable or offline. I've done some test about this scenario, the
Connections
in nsqadmin remains 1 after I make my test server offline. But for the problem above, theConnections
is 0.Data for this connection is usually larger than other connections, but still far from reaching the limit set in
OutputBufferSize
.