leporo / tornado-redis

Asynchronous Redis client that works within Tornado IO loop.
666 stars 162 forks source link

Server hanging at FUTEX_WAIT_PRIVATE #88

Open kylehuang17 opened 9 years ago

kylehuang17 commented 9 years ago

Hi, I use tornado-redis to async subscribe some chatting message, but when the server running minutes later, maybe hours or days later , the server hanging without any error reason. I used the strace tool with $ strace -p "The Process PID", then I found the process stuck in this status futex(0x7ffff64b3e00, FUTEX_WAIT_PRIVATE, 2, NULL) = 0.
Can Anyone tell me what causing this issue, and how to solve it ? Some others solved similar situations by setting the connection timeout... Would setting the stay_after be useful?

Thanks your help.

blodone commented 9 years ago

I'm having similar problems but not the whole server hangs, just the subscribing to redis, especially the publishing from redis to tornado is not working anymore

kylehuang17 commented 9 years ago

My situation is the same, only part of the processes would hang. I still not found the solution for this even I had try to tune the linux network core params and redis params. So I changed the transfer protocol from pure Websocket to SockJS which is like Socket.IO, trying to use Websocket in default and replacing by comet technology when Websocket not working in some network environment. Not like the example in Websocket, tornado-redis uses only one socket to subscribe messages of all channels, and the issue never happening. I guess that ... maybe there are strange things that would cause this problem while too many redis subscribe clients exist.

blodone commented 9 years ago

i use swampdragon which has tornado-redis as a part but with SockJS as protocol. For me i have created a heartbeat checker which checks publish on tornado every 10secs and if its not working anymore it restarts the whole tornado server... works kind of... but i think it should work more reliable without restarting the whole thing and hassle with the re-subscribing of every client...

Tinche commented 9 years ago

So we're getting this as well. It seems the tornado-redis client destructor is to blame. I've managed to attach to a blocked process with Pyrasite and get the following stack trace:

['  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main\n    "__main__", fname, loader, pkg_name)\n',
 '  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code\n    exec code in run_globals\n',
 '  File "/root/ngs_webserver/server.py", line 117, in <module>\n    main()\n',
 '  File "/root/ngs_webserver/server.py", line 113, in main\n    tornado.ioloop.IOLoop.instance().start()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/zmq/eventloop/ioloop.py", line 151, in start\n    super(ZMQIOLoop, self).start()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 781, in start\n    self._run_callback(callback)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 565, in _run_callback\n    ret = callback()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper\n    return fn(*args, **kwargs)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 508, in wrapper\n    return callback(*args)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/stack_context.py", line 314, in wrapped\n    ret = fn(*args, **kwargs)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/websocket.py", line 530, in _on_frame_start\n    self.stream.read_bytes(2, self._on_frame_length_16)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 284, in read_bytes\n    self._try_inline_read()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 653, in _try_inline_read\n    self._read_from_buffer(pos)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 720, in _read_from_buffer\n    self._run_read_callback(pos, False)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 636, in _run_read_callback\n    self._run_callback(callback, self._consume(size))\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 537, in _run_callback\n    self.io_loop.add_callback(wrapper)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 888, in add_callback\n    stack_context.wrap(callback), *args, **kwargs))\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornadoredis/client.py", line 260, in __del__\n    connection.disconnect()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornadoredis/connection.py", line 97, in disconnect\n    s.close()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 388, in close\n    self._maybe_run_close_callback()\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 416, in _maybe_run_close_callback\n    self._run_callback(cb)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/iostream.py", line 537, in _run_callback\n    self.io_loop.add_callback(wrapper)\n',
 '  File "/root/.venv/local/lib/python2.7/site-packages/tornado/ioloop.py", line 883, in add_callback\n    with self._callback_lock:\n']

So, the client is trying to close its connection in the destructor, which gets executed while another callback is being added (and a non-reentrant lock has been taken already in the io loop), hence the deadlock.