Handle 410 Expired messages without crashing

ahovgaard commented 4 years ago

I have observed messages with code 410 and reason Expired arriving and causing the Kazan.Watcher process to crash with a function clause error like follows:

2020-03-30 14:26:51.276 [error] pid=<0.1720.0> module=gen_server function=error_info/7 line=889  GenServer #PID<0.1720.0> terminating
** (FunctionClauseError) no function clause matching in Kazan.Watcher.extract_rv/1
    (kazan 0.11.0) lib/kazan/watcher.ex:309: Kazan.Watcher.extract_rv(%{"apiVersion" => "v1", "code" => 410, "kind" => "Status", "message" => "The resourceVersion for the provided watch is too old.", "metadata" => %{}, "reason" => "Expired", "status" => "Failure"})
    (kazan 0.11.0) lib/kazan/watcher.ex:281: Kazan.Watcher.process_line/3
    (elixir 1.10.1) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (kazan 0.11.0) lib/kazan/watcher.ex:201: Kazan.Watcher.handle_info/2
    (stdlib 3.11) gen_server.erl:637: :gen_server.try_dispatch/4
    (stdlib 3.11) gen_server.erl:711: :gen_server.handle_msg/6
    (stdlib 3.11) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: %HTTPoison.AsyncChunk{chunk: "{\"type\":\"ERROR\",\"object\":{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"The resourceVersion for the provided watch is too old.\",\"reason\":\"Expired\",\"code\":410}}\n", id: #Reference<0.1835110855.1638924289.183387>}

This intention of this PR is to handle this scenario more gracefully, in the same way as the 410 Gone message.

The Kubernetes documentation about this 410 Expired message is very limited. I'm curious to hear if someone knows why this may happen (in my case around every 10 minutes) and if there is a way to avoid this, since it seems to me that there might be a risk of losing events when this happens.

Any comments and suggestions are appreciated.

obmarg commented 4 years ago

Hey @ahovgaard - thanks a lot for this contribution.

Have done a bit of digging into these 410 Expired messages - this comment in some k8s go code that does watching suggests that 410 Expired is more or less equivalent to 410 Gone, so this seems like a reasonable fix.

As to the reasons for seeing these - I think it's down to the way watch requests are implemented in k8s. Behind the scenes k8s is using etcd for it's database, and is probably making one of these requests to etcd, passing the resourceVersion in to the watch request. It also mentions on that page that etcd can compact revisions and when that happens older versions of resources will become unavailable.

In this bit of kuberentes documentation there's this snippet:

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default.

I assume (though I've never confirmed) this is referring to compacting revisions, and the GONE is received because we've started a watch referring to a revision that no longer exists.

I'm not aware of any way that this can be avoided. The official guidance is:

When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a list operation, and starting the watch from the resourceVersion returned by that new list operation. Most client libraries offer some form of standard tool for this logic. (In Go this is called a Reflector and is located in the k8s.io/client-go/cache package.)

The implementation for this in Go seems to be this function here: https://github.com/kubernetes/client-go/blob/b3b874faea7ef72a60cb1e5b804dc0b789b8f95e/tools/cache/reflector.go#L242-L429. We don't implement this in Kazan at the moment, just leave it up to users of the library to do so. Would be happy to accept a PR to add this though.

obmarg commented 4 years ago

I'm going to merge this just now anyway definitely seems like a good fix.

chazsconi commented 4 years ago

I realise I'm a bit late to the party here, but which version of K8S have you seen this on @ahovgaard ? I've not seen this in production, but this is maybe because we are bit behind in the K8S releases.

ahovgaard commented 4 years ago

@chazsconi I am not completely sure which version of k8s we were using at the time, but I suspect it was version 1.15.10.

chazsconi commented 4 years ago

Ok - that makes sense. We are using 1.14 right now, but will be upgrading shortly.

I know the PR has already been merged, but would it be better to send %Event{type: :gone} to the consumer if there is an expired rather than %Event{type: expired}? I think the action that the consumer needs to take will be the same for both a :gone and :expired? Otherwise this is a bit of a breaking change (well for me at least :) ) because I will have also to handle :expired in all the places in the code I currently handle :gone What do you think @ahovgaard ?

chazsconi commented 3 years ago

Hi @obmarg, @ahovgaard - any thoughts on my previous comment?

obmarg commented 3 years ago

Hi @chazsconi - don't think I'd be against just mapping expired into gone. Does seem like most of the docs refer to gone rather than expired, and that's the actual name for an http 410

ahovgaard commented 3 years ago

@chazsconi That seems reasonable to me as well. I added a different event type in case someone needed to handle :expired differently from :gone, but it seems this is unlikely to be needed.

obmarg / kazan

Handle 410 Expired messages without crashing #76