Closed chazsconi closed 5 years ago
Sorry for the delay in updating the comments. I am currently investigating a problem which causes Hackney Manager to crash when restarting watches after receiving a GONE event.
I am trying changing the stop_watch
function to use GenServer.call
to GenServer.cast
to ensure old watches are actually stopped before new ones are started. But do you have any other ideas about the cause of this?
Logs below:
07:51:11.936 [info] Stopping watch #PID<0.28502.0>
--
07:51:11.937 [info] Stopping watch #PID<0.28494.0>
07:51:11.937 [info] Stopping watch #PID<0.28498.0>
07:51:16.951 [error] GenServer #PID<0.28490.0> terminating
** (stop) exited in: :gen_server.call(:hackney_manager, {:start_async_response, #Reference<0.2232732825.1133772801.48337>, #PID<0.29311.0>, {:client, {1531, 288271, 934563}, {:metrics_ng, :metrics_dummy}, :hackney_ssl, 'kubernetes-env-api.ams1.xing.com', 6443, "kubernetes-env-api.ams1.xing.com:6443", [:async, {:stream_to, #PID<0.29311.0>}, {:ssl_options, [cacerts: [<<48, 130, 5, 162, 48, 130, 3, 138, 160, 3, 2, 1, 2, 2, 9, 0, 187, 44, 64, 67, 193, 243, 161, 87, 48, 13, 6, 9, 42, 134, 72, ...>>]]}, {:recv_timeout, 300000}], {:sslsocket, {:gen_tcp, #Port<0.53061>, :tls_connection, :undefined}, #PID<0.29313.0>}, {:default, #Reference<0.2232732825.1133772801.48337>, {'kubernetes-env-api.ams1.xing.com', 6443, :hackney_ssl}, #PID<0.2230.0>, :hackney_ssl}, #Reference<0.2232732825.1133772801.48337>, true, :hackney_pool, 300000, false, 5, false, 5, nil, nil, nil, {0, {:dict, 0, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], ...}, {{[], [], [], [], [], [], [], [], [], [], [], [], ...}}}}, :connected, :waiting, nil, :normal, false, true, false, :undefined, #PID<0.29311.0>, &:hackney_request.send/2, :waiting, nil, 4096, "", [], :undefined, nil, nil, nil, "GET", "/apis/batch/v1/namespaces/xb-preview-xb2-charles-bernasconi/jobs?resourceVersion=56337112&watch=true", nil}})
** (EXIT) time out
(stdlib) gen_server.erl:206: :gen_server.call/2
(hackney) /build/deps/hackney/src/hackney_manager.erl:160: :hackney_manager.start_async_response/1
(hackney) /build/deps/hackney/src/hackney_response.erl:46: :hackney_response.start_response/1
(hackney) /build/deps/hackney/src/hackney.erl:359: :hackney.send_request/2
(httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
(kazan) lib/kazan/client/imp.ex:62: Kazan.Client.Imp.run/2
(kazan) lib/kazan/watcher.ex:251: Kazan.Watcher.start_request/1
(kazan) lib/kazan/watcher.ex:213: Kazan.Watcher.handle_info/2
Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code.
Don't know precisely what's going on here, but happy to try and help diagnose
So:
Thanks for the tips, however I removed the calls to :hackney.stop_async
and this seems to have solved the problem. I think Hackney monitors the calling process and kills itself when it sees the caller dies, so I think maybe the :hackney.stop_async
was unnecessary and causing a race condition. I've made a couple of other minor tweeks also but I want to ensure everything is running smoothly in production for a few days before I update the PR.
Any luck running this in production @chazsconi?
@obmarg Seems much better than before. When I get a chance I'll update the PR with your comments and the hackney changes.
Sorry for the delay.
Thanks for the update @chazsconi and no rush. Just wanted to check in before I release today, but this can always go out in a future release 👍
@obmarg I've finally updated the PR with your comments. Sorry for the long delay. There are also a few other changes since the original PR to address issues I found by running this in production (which was a fork of these commits from 749b252.)
Not sure why the tests are failing - I don't think it is related to the changes in this PR.
Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code.
Don't know precisely what's going on here, but happy to try and help diagnose
So:
- What version of hackney are you using?
- Can you share some reproduction code?
- Have you used observer to see what the hackney_manager process is up to? I have had cases where it seems to get deadlocked somehow and stop responding to messages altogether. That could cause this I assume, though it might be a different issue.
@obmarg I am also seeing random :hackney_manager terminating
errors when using stream_to
option. Have you been able to find a way to permanently fix this issue when it happened for you?
Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code. Don't know precisely what's going on here, but happy to try and help diagnose So:
- What version of hackney are you using?
- Can you share some reproduction code?
- Have you used observer to see what the hackney_manager process is up to? I have had cases where it seems to get deadlocked somehow and stop responding to messages altogether. That could cause this I assume, though it might be a different issue.
@obmarg I am also seeing random
:hackney_manager terminating
errors when usingstream_to
option. Have you been able to find a way to permanently fix this issue when it happened for you?
@emilsoman It was a while ago I was having these issues, and the fixes in the issues I linked solved it for me. Are these issues you're seeing when using kazan, or did you just stumble across this PR independantly?
Thanks again for the contribution @chazsconi - I'll figure out why the tests are failing and fix that seperately.
@obmarg thanks a lot for merging
@emilsoman The problems I previously saw with Hackney I think are fixed in this PR - removing the call to :hackney.stop_async(id)
seemed to help.
Just released this as part of v0.10.0
As documented in Kubernetes, a watch on a resource can return a 410 response code (https://kubernetes.io/docs/reference/api-concepts/)
In this case the consumer need to be notified and resynchronise.
This is a breaking change as the messages now sent to the consumer are a different struct so the
from
can be included, allowing the consume to know which watch it corresponds to.