Watcher fails with 410 "GONE"

chazsconi commented 6 years ago

If one particular event type is been watched (e.g. Namespace changes) and no new events happen for a long time, but other events, that are not being watched happen, the other, non-watched events, cause the RV to increase.

Eventually the RV that is stored in the watcher is too old to be used, and the watcher fails returning a 410 response code.

If an original RV was passed to the watcher, the supervisor restarts the watcher with the same init parameters, and it fails again.

Possible solutions:

1) Parse the message that comes back to extract a later RV

2) Send a message to the caller to inform it that it can no longer watch and it should start again

3) Somehow obtain the latest RV from K8S without potentially missing events

Logs below:

FunctionClauseError) no function clause matching in Kazan.Watcher.extract_rv/1
    (kazan) lib/kazan/watcher.ex:235: Kazan.Watcher.extract_rv(%{"apiVersion" => "v1", "code" => 410, "kind" => "Status", "message" => "too old resource version: 20293058 (20295457)", "metadata" => %{}, "reason" => "Gone", "status" => "Failure"})
    (kazan) lib/kazan/watcher.ex:219: anonymous fn/4 in Kazan.Watcher.process_lines/2
    (elixir) lib/enum.ex:1899: Enum."-reduce/3-lists^foldl/2-0-"/3
    (kazan) lib/kazan/watcher.ex:158: Kazan.Watcher.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message: %HTTPoison.AsyncChunk{chunk: "{\"type\":\"ERROR\",\"object\":{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"too old resource version: 20293058 (20295457)\",\"reason\":\"Gone\",\"code\":410}}\n", id: #Reference<0.4253789252.1463549953.187567>}

chazsconi commented 6 years ago

@obmarg any further ideas on how to solve this? My current preference is option (2) above.

obmarg commented 6 years ago

@chazsconi do you have any code that can reproduce this situation? I'd like to investigate a little before deciding on a solution.

chazsconi commented 6 years ago

@obmarg I will try to reproduce this with a test, however the scenario in which I saw it happen was after over 48 hours of no events being received by the watcher, so it might not be simple.

After seeing this ticket, I think the resource version getting too old is a normal behaviour.

https://github.com/kubernetes/kubernetes/issues/55230

obmarg commented 6 years ago

@chazsconi Yeah, I agree it seems like a normal thing to happen. Definitely something Kazan should do something about. Since it seems to happen in response to other things changing in the database, I was hoping there'd be a simple way to induce the issue by just changing tons of un-watched resources. Though I guess it depends precisely what changes need to be made in order to induce the issue.

I did come across this PR that updates kubectl to handle something similar. Seems like (though I've not yet tested) if you specify a resourceVersion of "0" you'll get sent the latest resource, and then any updates to that resource. Was wondering if we could utilize this for the issue you're experiencing, though wanted to play around with the issue to see whether it made sense before suggesting...

chazsconi commented 6 years ago

@obmarg I tried to reproduce the problem by creating a watch for pod changes in a namespace and then making changes to a config map, but even after 75000 changes to the config map the problem did not occur on the watch.

However, as you say, specifying a resource version of "0" appears to send only new events. So a simple solution is to fix the code to revert to resource version "0" when the "too old resource version" problem occurs.

However there is a possibility of events being missed between the original watch failing and restarting it with resource version "0" - if this is critical, then the consumer can be informed and it can refetch everything again.

Therefore maybe providing both options in the library would be best and allow the consumer of the library to choose which they prefer:

Inform the consumer and terminate the watch
Restart from resource version "0"

obmarg commented 6 years ago

@chazsconi Yeah, maybe we'll end up having to do that. Though I don't know if I'm too happy with an option that forces people to pick between handling a semi-rare event (that we can't explain enough about to say when it'll happen) and potentially losing events. Though since this seems to happen when the watched resource hasn't been updated for days, it seems like the chances of it changing in the second or so where we're restarting the watch is unlikely.

I definitely want to do a bit more investigation around this before settling on a fix, though if you're blocked I'd be happy to release a temporary fix.

chazsconi commented 6 years ago

@obmarg I actually the saw the problem occur again today. An event had been received on the watched resource around 1 hour before, so it's certainly not days before this happens.

We are currently using K8S v1.8.8. We are planning to upgrade to v1.9.7 next week, and I'd be interested to see if the problem still occurs after the upgrade.

chazsconi commented 6 years ago

@obmarg I'm currently running against my fork in production with this fix: https://github.com/chazsconi/kazan/commit/42b297dd89ca30d1d714f9400c0f47cae8d2a067

As I cannot reproduce this in tests I'm waiting for the scenario to re-occur to check that this fixes the problem. After that I can create this as a PR.

chazsconi commented 6 years ago

Unfortunately this does not work. Setting the resource version to "0" causes some previous changes to be resent, as does not setting a resource version at all.

After checking the K8S docs, I found that the correct way to do it, is to refetch the resources. (https://kubernetes.io/docs/reference/api-concepts/)

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation.

Therefore, I believe that option (1) that I listed above is the solution. i.e. inform the consumer, and let the consumer decide how to deal with it.

obmarg commented 6 years ago

Apologies for my complete silence on this. I’ve been in the process of finding a new job, which has taken up just about all of my time. Almost sorted now though, so will be ready to feedback soon.

chazsconi commented 6 years ago

No problem. We're currently running with with a fork using this commit in production: https://github.com/chazsconi/kazan/commit/a201af75f9e945c57c0b48485a493295d66d2edf for the last couple of weeks and it appears to solve the problem, although, of course, we need to handle the 410 in the consumer.

This is a breaking change as the messages now sent to the consumer are a different struct so the from can be included.

obmarg commented 6 years ago

Ok, so I'd ideally like a way for kazan to handle this automatically. However, it's not clear the best way to do that, and forcing users to handle it is at least a step in the right direction (since it's currently un-handleable). If you want to make a PR with that branch I'd be happy to accept after a bit of review.

chazsconi commented 6 years ago

@obmarg Sorry for the delay - PR now created.

obmarg commented 6 years ago

No problem, thanks for the PR 👍. I’m without access to a laptop for the next couple of weeks. Will give it a look once I’m back. On Fri, 15 Jun 2018 at 18:29, chazsconi notifications@github.com wrote:

@obmarg https://github.com/obmarg Sorry for the delay - PR now created.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/obmarg/kazan/issues/45#issuecomment-397657933, or mute the thread https://github.com/notifications/unsubscribe-auth/AAh9ymDmMTmyXmsYY3p1eA4ChGj1tMqiks5t89LngaJpZM4T0vSh .

obmarg commented 5 years ago

Fixed in #47

obmarg / kazan

Watcher fails with 410 "GONE" #45