Send "GONE" to consumer when 410 received.

chazsconi commented 6 years ago

As documented in Kubernetes, a watch on a resource can return a 410 response code (https://kubernetes.io/docs/reference/api-concepts/)

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

In this case the consumer need to be notified and resynchronise.

This is a breaking change as the messages now sent to the consumer are a different struct so the from can be included, allowing the consume to know which watch it corresponds to.

chazsconi commented 6 years ago

Sorry for the delay in updating the comments. I am currently investigating a problem which causes Hackney Manager to crash when restarting watches after receiving a GONE event.

I am trying changing the stop_watch function to use GenServer.call to GenServer.cast to ensure old watches are actually stopped before new ones are started. But do you have any other ideas about the cause of this?

Logs below:

07:51:11.936 [info] Stopping watch #PID<0.28502.0>
--
07:51:11.937 [info] Stopping watch #PID<0.28494.0>
07:51:11.937 [info] Stopping watch #PID<0.28498.0>
07:51:16.951 [error] GenServer #PID<0.28490.0> terminating
** (stop) exited in: :gen_server.call(:hackney_manager, {:start_async_response, #Reference<0.2232732825.1133772801.48337>, #PID<0.29311.0>, {:client, {1531, 288271, 934563}, {:metrics_ng, :metrics_dummy}, :hackney_ssl, 'kubernetes-env-api.ams1.xing.com', 6443, "kubernetes-env-api.ams1.xing.com:6443", [:async, {:stream_to, #PID<0.29311.0>}, {:ssl_options, [cacerts: [<<48, 130, 5, 162, 48, 130, 3, 138, 160, 3, 2, 1, 2, 2, 9, 0, 187, 44, 64, 67, 193, 243, 161, 87, 48, 13, 6, 9, 42, 134, 72, ...>>]]}, {:recv_timeout, 300000}], {:sslsocket, {:gen_tcp, #Port<0.53061>, :tls_connection, :undefined}, #PID<0.29313.0>}, {:default, #Reference<0.2232732825.1133772801.48337>, {'kubernetes-env-api.ams1.xing.com', 6443, :hackney_ssl}, #PID<0.2230.0>, :hackney_ssl}, #Reference<0.2232732825.1133772801.48337>, true, :hackney_pool, 300000, false, 5, false, 5, nil, nil, nil, {0, {:dict, 0, 16, 16, 8, 80, 48, {[], [], [], [], [], [], [], [], [], [], [], [], [], [], ...}, {{[], [], [], [], [], [], [], [], [], [], [], [], ...}}}}, :connected, :waiting, nil, :normal, false, true, false, :undefined, #PID<0.29311.0>, &:hackney_request.send/2, :waiting, nil, 4096, "", [], :undefined, nil, nil, nil, "GET", "/apis/batch/v1/namespaces/xb-preview-xb2-charles-bernasconi/jobs?resourceVersion=56337112&watch=true", nil}})
** (EXIT) time out
(stdlib) gen_server.erl:206: :gen_server.call/2
(hackney) /build/deps/hackney/src/hackney_manager.erl:160: :hackney_manager.start_async_response/1
(hackney) /build/deps/hackney/src/hackney_response.erl:46: :hackney_response.start_response/1
(hackney) /build/deps/hackney/src/hackney.erl:359: :hackney.send_request/2
(httpoison) lib/httpoison/base.ex:439: HTTPoison.Base.request/9
(kazan) lib/kazan/client/imp.ex:62: Kazan.Client.Imp.run/2
(kazan) lib/kazan/watcher.ex:251: Kazan.Watcher.start_request/1
(kazan) lib/kazan/watcher.ex:213: Kazan.Watcher.handle_info/2

obmarg commented 6 years ago

Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code.

Don't know precisely what's going on here, but happy to try and help diagnose

So:

What version of hackney are you using?
Can you share some reproduction code?
Have you used observer to see what the hackney_manager process is up to? I have had cases where it seems to get deadlocked somehow and stop responding to messages altogether. That could cause this I assume, though it might be a different issue.

chazsconi commented 6 years ago

Thanks for the tips, however I removed the calls to :hackney.stop_async and this seems to have solved the problem. I think Hackney monitors the calling process and kills itself when it sees the caller dies, so I think maybe the :hackney.stop_async was unnecessary and causing a race condition. I've made a couple of other minor tweeks also but I want to ensure everything is running smoothly in production for a few days before I update the PR.

obmarg commented 6 years ago

Any luck running this in production @chazsconi?

chazsconi commented 6 years ago

@obmarg Seems much better than before. When I get a chance I'll update the PR with your comments and the hackney changes.

Sorry for the delay.

obmarg commented 6 years ago

Thanks for the update @chazsconi and no rush. Just wanted to check in before I release today, but this can always go out in a future release 👍

chazsconi commented 5 years ago

@obmarg I've finally updated the PR with your comments. Sorry for the long delay. There are also a few other changes since the original PR to address issues I found by running this in production (which was a fork of these commits from 749b252.)

Not sure why the tests are failing - I don't think it is related to the changes in this PR.

emilsoman commented 5 years ago

Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code.

Don't know precisely what's going on here, but happy to try and help diagnose

So:

What version of hackney are you using?

Can you share some reproduction code?

Have you used observer to see what the hackney_manager process is up to? I have had cases where it seems to get deadlocked somehow and stop responding to messages altogether. That could cause this I assume, though it might be a different issue.

@obmarg I am also seeing random :hackney_manager terminating errors when using stream_to option. Have you been able to find a way to permanently fix this issue when it happened for you?

obmarg commented 5 years ago

Oh no, not hackney problems... I've had quite a lot of issues with hackneys async code in the past, it does not seem to be as reliable as it's synchronous code. Don't know precisely what's going on here, but happy to try and help diagnose So:

What version of hackney are you using?

Can you share some reproduction code?

Have you used observer to see what the hackney_manager process is up to? I have had cases where it seems to get deadlocked somehow and stop responding to messages altogether. That could cause this I assume, though it might be a different issue.

@obmarg I am also seeing random :hackney_manager terminating errors when using stream_to option. Have you been able to find a way to permanently fix this issue when it happened for you?

@emilsoman It was a while ago I was having these issues, and the fixes in the issues I linked solved it for me. Are these issues you're seeing when using kazan, or did you just stumble across this PR independantly?

obmarg commented 5 years ago

Thanks again for the contribution @chazsconi - I'll figure out why the tests are failing and fix that seperately.

chazsconi commented 5 years ago

@obmarg thanks a lot for merging @emilsoman The problems I previously saw with Hackney I think are fixed in this PR - removing the call to :hackney.stop_async(id) seemed to help.

obmarg commented 5 years ago

Just released this as part of v0.10.0

obmarg / kazan

Send "GONE" to consumer when 410 received. #47