nsqio / go-nsq

The official Go package for NSQ
MIT License
2.59k stars 444 forks source link

Connection is closed for unknown reason #244

Closed dxn1988 closed 5 years ago

dxn1988 commented 5 years ago

I have few consumer connections in my nsq, one of them will close for no reason at some time but not very frequently, and the connection will be resumed after restarting nsq service.

Here's the error log:

2019/01/25 10:05:43 [I] 2 ERR    1 [blackboxdjson/gjjzxzl] (***************:9150) IO error - write tcp 192.168.0.103:34208->**********:9150: write: broken pipe 
2019/01/25 10:05:43 [I] 2 ERR    1 [blackboxdjson/gjjzxzl] (***************:9150) error sending command FIN 0aeeacedfe4c3002 - write tcp 192.168.0.103:34208->*******************:9150: write: broken pipe 
2019/01/25 10:05:43 [I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) beginning close 
2019/01/25 10:05:43 [I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) breaking out of writeLoop 
2019/01/25 10:05:43 [I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) writeLoop exiting 
2019/01/25 10:05:44 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:45 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:46 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:47 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:48 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:49 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:50 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:52 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:53 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:54 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:55 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:56 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:57 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:58 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 
2019/01/25 10:05:59 [I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight 

I have no idea why the connection is closed but I have few assumptions:

  1. The receiver server is not stable or offline. I've done some test about this scenario, the Connections in nsqadmin remains 1 after I make my test server offline. But for the problem above, the Connections is 0.

  2. Data for this connection is usually larger than other connections, but still far from reaching the limit set in OutputBufferSize.

ploxiln commented 5 years ago

Hmm. So, if you kill power to the server, the other side of the tcp connection (nsqd) does not know, and it takes a while to decide that TCP connection is gone (a very long while if tcp keep-alives are not enabled). If just the process crashes or exits somehow, then the OS kernel closes those file descriptors, and thus the tcp connection, which involves sending a FIN or RST, so in this case the other side (nsqd) does know right away. But in this case, the consumer process is still running, and it notices that the connection was closed, and nsqd also notices pretty quickly that it is closed. So this is something else, and I don't have any great ideas what. Maybe check "sudo dmesg". Maybe just replace this server and see if this issue ever comes back :)

Unfortunately this is probably the kind thing where we can't debug it. If you can debug it, and offer a fix with a convincing explanation, we can judge that and may accept it. But there may not be anything we can do, sorry.

mreiferson commented 5 years ago

Gonna close, but please re-open if you find any indication that this could be a bug in nsqd or go-nsq.

dxn1988 commented 5 years ago

@ploxiln Thanks for your reply.

The server our nsq service pushes data to is our client's business server, so it's highly unlikely because the server is offline or other issues related with the server status.

Another thing I didn't mention above is the max_in_flight setting, we set max_in_flight to 10 but we are not sure whether our client's server could handle all our data immediately and also the quality of connection. Could this have something to do with the problem?

Also because of some 'external' reasons, we can't get a technician from our client to do necessary debugging or testing for us :(

ploxiln commented 5 years ago

Ah - then my best guess is a stateful firewall is involved. max_in_flight "too high" would probably just result in messages timing out.

opvexe commented 1 year ago

Can I ask, how did you solve it? I found the same problem, sometimes running for a while will have the same problem as you.

letUnSolveStep/Chegg] (8.219.156.21:4150) IO error - write tcp 192.168.31.108:61023->8.219.156.21:4150: write: broken pipe
ERR    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) error sending command FIN 11e68f9fb8ec4000 - write tcp 192.168.31.108:61023->8.219.156.21:4150: write: broken pipe
INF    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) beginning close
INF    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) breaking out of writeLoop
INF    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) writeLoop exiting
WRN    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) delaying close, 2 outstanding messages
INF    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) readLoop exiting
WRN    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) draining... waiting for 2 messages in flight
WRN    2 [QuizletUnSolveStep/Chegg] (8.219.156.21:4150) draining... waiting for 2 messages in flight

I have few consumer connections in my nsq, one of them will close for no reason at some time but not very frequently, and the connection will be resumed after restarting nsq service.

Here's the error log:

2019/01/25 10:05:43 �[1;34m[I] 2 ERR    1 [blackboxdjson/gjjzxzl] (***************:9150) IO error - write tcp 192.168.0.103:34208->**********:9150: write: broken pipe �[0m
2019/01/25 10:05:43 �[1;34m[I] 2 ERR    1 [blackboxdjson/gjjzxzl] (***************:9150) error sending command FIN 0aeeacedfe4c3002 - write tcp 192.168.0.103:34208->*******************:9150: write: broken pipe �[0m
2019/01/25 10:05:43 �[1;34m[I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) beginning close �[0m
2019/01/25 10:05:43 �[1;34m[I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) breaking out of writeLoop �[0m
2019/01/25 10:05:43 �[1;34m[I] 2 INF    1 [blackboxdjson/gjjzxzl] (***************:9150) writeLoop exiting �[0m
2019/01/25 10:05:44 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:45 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:46 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:47 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:48 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:49 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:50 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:52 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:53 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:54 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:55 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:56 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:57 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:58 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m
2019/01/25 10:05:59 �[1;34m[I] 2 WRN    1 [blackboxdjson/gjjzxzl] (***************:9150) draining... waiting for 51 messages in flight �[0m

I have no idea why the connection is closed but I have few assumptions:

  1. The receiver server is not stable or offline. I've done some test about this scenario, the Connections in nsqadmin remains 1 after I make my test server offline. But for the problem above, the Connections is 0.
  2. Data for this connection is usually larger than other connections, but still far from reaching the limit set in OutputBufferSize.

@ploxiln Thanks for your reply.

The server our nsq service pushes data to is our client's business server, so it's highly unlikely because the server is offline or other issues related with the server status.

Another thing I didn't mention above is the max_in_flight setting, we set max_in_flight to 10 but we are not sure whether our client's server could handle all our data immediately and also the quality of connection. Could this have something to do with the problem?

Also because of some 'external' reasons, we can't get a technician from our client to do necessary debugging or testing for us :(