helm / charts

⚠️(OBSOLETE) Curated applications for Kubernetes
Apache License 2.0
15.49k stars 16.79k forks source link

[stable/traefik] Disable compression by default? #2098

Closed kachkaev closed 6 years ago

kachkaev commented 7 years ago

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

somewhat a bug

Version of Helm and Kubernetes:

N/A

Which chart:

stable/traefik

What happened:

I noticed that the traefik chart by default sets compress = true for both http and https endpoints by having gzip.enabled: true in values.yml. I saw a couple of issues with this and decided to bring them up for discussion:

If the chart was in its early days, I'd recommend setting gzip.enabled to false by default, but now this is a breaking change. However given that traefik 1.4 will be released soon, maybe a time for a BC is not too bad? We could also use this opportunity to clean-up a few things e.g. https://github.com/kubernetes/charts/issues/1457.

WDYT?

/cc @krancour @electroma

krancour commented 7 years ago

Thanks for starting this discussion. I've also found a couple scenarios where I didn't want Traefik's compression enabled. iirc, one case involved a Rails app that was doing its own compression and the response from Traefik was coming through doubly compressed.

I might disagree with a couple of your points:

  1. Overall, there's probably not much benefit from compressing all the requests by traefik itself

    Remember that an ingress controller takes the place of a web server. It was never uncommon or considered "not beneficial" to have Apache or Nginx doing compression. So it's hard for me to see why we wouldn't often want Traefik to do the same.

  2. I guess that most of the upstream containers in most cases will do the compression

    While some web applications frameworks (Rails comes to mind) may handle this, I think that with the advent of microservice architectures, the relative complexity of the average program running in in a containers is greatly reduced.

Even if I dispute those points, I agree with you 100% with your conclusion. My rationale is that any option that frequently trips people up shouldn't be a default.

In terms of fixing this, is the release of Traefik v1.4.0 the right time to consider a major release of the chart with breaking changes as you suggestt? I'm not sure. I'm personally eager to pay down tech debt, but I think that Traefik 2.0.0 (whenever that happens) seems like a more natural time to tackle that because breaking changes will presumably be unavoidable at that point in time. i.e. It's probably easier to defend breaking changes in the chart by saying they were unavoidable than by saying that a couple of us simply felt it was time.

I'd like to hear the opinions of some others who have contributed to this chart, especially @c-knowles.

kachkaev commented 7 years ago

Thanks for your thoughts @krancour, I'd be also curious to hear @c-knowles's opinion.

Further investigation of https://github.com/containous/traefik/issues/1908#issuecomment-328892261 showed that the problem with tls and compression appeared due to very low memory limits that this chart sets by default. So I guess we either need to turn compression off or increase memoryLimit, which in any case is a breaking change. Not doing this might mean that more chart users will just end up with lots of iowait even when the server load is tiny.

cknowles commented 7 years ago

I think most users will expect a certain amount of flux as charts and the helm ecosystem in general mature. As long as the versions are still semver and it's documented what the backwards incompatible part is then that seems absolutely fine. So as a user I'm totally happy to see breaking changes, especially for such simple updates.

Having said that, defaults for any chart should revolve around the recommended practice. Should recommended practice for Traefik be to compress articles that will benefit from it and which are not already compressed? I think so but perhaps others have different views like the simpler the better.

Currently we have the gzip option overridden to false due to https://github.com/containous/traefik/issues/1060 and https://github.com/containous/traefik/issues/1851. Most of those double compression issues seems like they've been worked out so I'm considering switching it back on. I'm definitely not saying we should change chart defaults because of other software bugs, especially if we think we'd switch the default back again later. Maybe at least call them out in the readme.

Perhaps some of our effort would best be directed into the Traefik project itself. For example, picking more sensible defaults so it disallows already heavily compressed articles like png, zip etc. That way it benefits all users of Traefik not just Kubernetes users and we could hook into the new options. I had a look at that because we also have some nginx deployments I'd like to ditch. Those tend to have compression level, min length, mime types list, disabled user agents etc all tweaked with better defaults. Some of those options are not supported by Traefik yet. The default min size seems to be 512 bytes but it's not configurable. Content types would be a case of adding ContentTypes to the compress middleware. The current default for that seems to be all types accepted for compression which does not seem to be a sensible default but the gzip handler being used only has an allowed list not a disallowed list so harder to just disable a few types.

cknowles commented 7 years ago

@kachkaev what do you have your memory request/limit on? We've basically doubled ours from the defaults otherwise we encountered issues running with many ingresses and hitting the dashboard.

kachkaev commented 7 years ago

I agree with @c-knowles that although breaking changes are not completely painless for the users, there is nothing too bad in them as long as everything is properly documented. One could do something like this (which is recommended in any case):

## update the chart get latest chart version
helm repo update
helm inspect stable/traefik

## check what version you're currently running
helm history my-traefik

## if a major version has been bumped, update your values.yml
## depending on what breaking change you've read about

## upgrade
helm upgrade my-traefik stable/traefik --values ./values.yaml

After more than a workday spent on investigating https://github.com/containous/traefik/issues/1908 just to find that the chart defaults were inappropriate for quite a basic use case, I'd like to vote for a BC just to save others from the same trap.

At the moment my memoryLimit is set to 100Mi, but it's an arbitrary number that is simply greater than the current default one, not a result of any benchmarking. 30Mi seems to be not enough even for users who want load a single page with 30 jpegs via tls+traefik's gzip.

My intuition suggests that turning traefik compression off by default brings more benefits than harm. Although the double-compression bug has already been fixed, upstream webservers still have more flexibility in terms of what they compress and what not. Doing unnecessary compression of jpegs, videos and zips is worse IMO than not compressing htmls, because this makes the server slower and overloads client's CPU without any gains. Those who only want to serve text APIs via traefik and keep their microservices simple can turn the compression on, but that's a pretty peculiar case rather than a common scenario. Serving jpegs and other large files along htmls, javascripts and styles is far more common.

Conclusion: I'd vote for chart version 2 with

kachkaev commented 7 years ago

Following https://github.com/containous/traefik/issues/1908, traefik guys removed all default resource limits and replaced them with a notice: https://github.com/containous/traefik/pull/2113

kachkaev commented 7 years ago

UPD: I just deep-refreshed a page with a few jpegs in three concurrent tabs and saw the io peak again despite increasing traefik's memory limit from 30 to 100Mi! So updated the comment above by suggesting to remove resource definitions completely.

krancour commented 7 years ago

no memoryLimit (even 100Mi was not enough in a couple of my manual tls+gzip tests) and no other preconfigured resources

Unless something has changed recently, the repo maintainers (I am deputized to help maintain this chart, but am not a repo maintainer) will not accept a chart that doesn't set some default resource limits... and for good reason, imo. No one wants to deploy container(s) with no upper bound on resource utilization. A memory leak, for instance, could lead a container to monopolize all of a single node's resources.

Also, I'm sorry, but I think we're being too cavalier with the willingness to introduce breaking changes that compel a major release of this chart. Trust me, I am as eager as you to pay down some tech debt in this chart, but I don't want the cost of doing so to be a disservice to whoever is using this chart. It's nice that you "vote" for a breaking change (as would @c-knowles it appears, and as would I under other circumstances), but that's a survey of precisely three people. I don't think that is by any means an adequate sample size to establish that this wouldn't be a major inconvenience to everyone else who uses this.

cknowles commented 7 years ago

I don't mind either way tbh, but we should have the "right" reasons to change. Bugs or bad defaults in Traefik we should fix there. Unbounded default resource limits definitely sound like a bad idea. Admission controllers, pod presets, initializers etc could all be useful for an operator to prevent unbounded limits entering their system.

kachkaev commented 7 years ago

@c-knowles I don't think that this is the case of traefik having wrong defaults, because it neither enables compression out of box nor sets up any memory limits. Both of these defaults are the opinions of the helm chart, which end up being not optimal. I do consider them as a bug, because the result is the behaviour demonstrated in https://github.com/containous/traefik/issues/1908#issuecomment-328343800. Bugs do need to get fixed, even if this implies a breaking change.

After hearing your arguments I can agree with you and @krancour that resource limits is something that should remain – good point about memory leaks and the danger of monopolising the nodes. Unfortunately, I'm not sure I'll be able to advise any numbers as I haven't put traefik under high load yet (especially with tls and compression, a combination of which is quite memory-greedy). Nevertheless, I strongly believe that the defaults need to be different from what is now and this should not wait for too long, because otherwise people will be getting hard-to-debug iowait issues in production.

Just of the curiosity, I checked the current amount of 'flux' in the official charts:

helm update

helm search | grep -c "stable/"
## 99 (stable charts total)

for N in 0 1 2 3 4 5 6; do
  COUNT=$(helm search | grep -c "stable/.*\\s${N}\\.");
  echo "version ${N}.x: ${COUNT} charts";
done

## version 0.x: 83 charts
## version 1.x: 12 charts
## version 2.x: 2 charts
## version 3.x: 0 charts
## version 4.x: 1 charts
## version 5.x: 0 charts
## version 6.x: 1 charts

Out of all the stable charts in the repo, only 17% are declared as semver-stable, the others are still < v1. This suggests that a breaking change is not something that is expected to happen only once in a long while (e.g. with the release of traefik 2.0, which is not on the horizon). Even if we value the stability of the existing traefik chart and thus delay fixing suboptimal defaults, a BC will happen soon anyway – see https://github.com/kubernetes/charts/issues/1785.

krancour commented 7 years ago

Both of these defaults are the opinions of the helm chart, which end up being not optimal. I do consider them as a bug, because the result is the behaviour demonstrated in containous/traefik#1908 (comment). Bugs do need to get fixed, even if this implies a breaking change.

I agree, as I've said before, that if ("if" being the operative word here) something is tripping up a lot of people, then it should not be a default. That is to say, if it were easier to change this "opinion," I'd be very much in favor of it.

That said, I have a hard time seeing this as a bug and using that as justification for a breaking change. This default, as you said, is an "opinion" of the chart. Your disagreement with that opinion doesn't qualify it as a bug. Defaults are obviously overridable, which is what you should be doing.

I've read through containous/traefik#1908 (excellent sleuthing by the way!) and if I understand correctly, this issue occurs at the intersection of four conditions: TLS, gzip compression, low resource limits, high request volume.

All current users of this chart must fall into one of the following categories:

  1. Unaffected by containous/traefik#1908
  2. Have already overridden defaults to overcome containous/traefik#1908

This means there are really only two constituents who stand to benefit from the breaking change you propose:

  1. You
  2. New users of this chart

As for the first, I'm sure you're able to override any default that isn't working for you. So let us ask ourselves how we can reduce the chances of new users of this chart from encountering containous/traefik#1908 without introducing a breaking change.

The first thing that comes to mind is modifying the default resource limits. I don't know that I would consider a modest increase in the memory needs of the app (we're talking 30MB to 100MB, yes?) to be a breaking change, especially since the k8s scheduler can be counted on to shift a pod to a node with adequate resources based on the constraints specified.

The second idea that comes to mind is updating this chart's documentation re: using helm install to suggest the use of --set ... to either disable gzip compression or raise resource limits. We could even explain why they may wish to do so and link to containous/traefik#1908.

Thoughts?

kachkaev commented 7 years ago

Just a random observation: @unguiculus suggests not to specify any default resources for the vault chart:

screenshot from 2017-09-15 13-59-17

Interesting. Would be good to know the reasoning behind this. Could be the new chart standard.

kachkaev commented 7 years ago

One more thing. I started a new traefik issue https://github.com/containous/traefik/issues/2124, suggesting to enable a compress option only for certain content types. This will be ideal for a mix of microservices behind this reverse proxy, but the solution will have to wait till someone implements it (sadly, I don't know go).

krancour commented 7 years ago

Hmm. "don't specify defaults for resources" doesn't necessarily need to mean limits aren't required to be set. It's possibly to explicitly require a user to specify some otherwise undefined value. Of course we can't do that without a breaking change either.

Is it reasonable, do you think, do at least raise our default resource limits here? How much do they have to be raised to circumvent the issue?

kachkaev commented 7 years ago

Perhaps defining another default would be worse that not defining any while hinting a user on what it could be. The values seem to highly depend on the load and the config, so if we don't give any defaults, we make chart users explicitly aware that hey are responsible for picking the right numbers.

The reason why it took me so long to debug https://github.com/containous/traefik/issues/1908 was because I did not see any resource limits in my chart's values.yml and was blaming traefik internal code for iowait until I looked into the generated Deployment spec by accident (which I normally should not do, because it's deployed by helm). Changing memoryLimit from 30Mi to 100Mi felt like a solution for a bit of time, but the next day I easily ‘broke’ traefik again by quickly deep-refreshing a page with several jpegs in multiple tabs. When compress was turned off, 30Mi was always enough, but until what RPM?

If you still think that a breaking change for traefik is something that needs to happen once in a long while, I'd at least start with flagging out potential issues with the current values in the README. Speaking of the breaking changes, I'd propose turning off compression and removing the default resources, given that the first may cause harm and the second thing is something that @unguiculus recommends in another chart. But ideally I'd just make a breaking change ASAP, document it and not worry too much about it. Two reasons:

  1. 83% of the stable charts are still <v1 so a bug fix with a BC will not make an exceptional situation.
  2. https://github.com/kubernetes/charts/issues/1785 is inevitably coming.
krancour commented 7 years ago

I'd at least start with flagging out potential issues with the current values in the README.

100% on board with that.

83% of the stable charts are still <v1 so a bug fix with a BC will not make an exceptional situation.

I might technically be wrong, but I've never felt that v < 1.0.0 is adequate impetus to disrespect semver. Even if that's generally permitted or at least forgiven, this chart is already v > 1.0.0.

1785 is inevitably coming.

After a brief read over #1785, it's not immediately clear to me that addressing it forces a BC on this chart. Am I wrong?

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

krancour commented 6 years ago

This can be closed.

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten /remove-lifecycle stale

krancour commented 6 years ago

/close

kachkaev commented 6 years ago

Another person suffering from current chart defaults:

https://github.com/containous/traefik/issues/3341

krancour commented 6 years ago

Respectfully, containous/traefik#3341 isn't strongly related to this issue, and as stated before, the proposal to turn off compression by default is a breaking change.

If you need compression off, turn it off. Don't advocate for a breaking change that will surprise others with unanticipated behavior changes upon upgrade.