ELB provisions for Kubernetes

bookshelfdave commented 7 years ago

This is an umbrella issue covering recent AWS ELB discussions for Kubernetes-managed applications.

In a discussion with @jgmize and @metadave, we've decided to switch all K8s app ELB listener load balancer protocols to use TCP (and SSL) instead of http and https. We experimented with switching load balancer protocols to http/https in Toronto, which caused timeout issues with Gunicorn.

Our proposed solution, in two parts, is as follows:

As a first pass, we'll use a NodePort service that listens for http and https requests and directs via selector to the appropriate app deployment. These NodePorts have already been created and applied for snippets and careers, via this PR. Snippets and careers ELB's have been updated manually via the AWS console.

The second pass will be Terraform managed ELB's, which removes K8s LoadBalancer services for each application. This allows us to have full control of ELB creation without the use of alpha/beta K8s annotations. For any application that requires an http to https redirect externally, we can use a new K8s service running nginx that does a simple http->https redirect. Each application would then have 2 services:

SSL load balancer protocol serving https that connects to a NodePort service for the app
an nginx service that performs the http->https redirect.

The nginx redirector can use a horizontal pod autoscaler if we have dynamic http load.

Terraform ELB provisioning

I'll create a directory structure similar to the Snippets multi-region Terraform setup. I'll add one minor tweak, where Terraform state from all regions will be stored in a single shared bucket to help prevent S3 bucket clutter. In this directory, we'll have a main elb Terraform module, which serves as a sort of template that's used when creating load balancers. Any new K8s application that requires a load balancer would only need to populate the required variables and run Terraform to apply. For any additional ELB customization, the elb module can be duplicated (and renamed) with any any additional customization.

Referenced issues

We've decided not to update the Deis ELB to use http/https.
- https://github.com/mozmar/infra/issues/156
Careers and Snippets ELB's have been manually change via the AWS console to use TCP -> 80 and SSL (Secure TCP) -> 443.
- https://github.com/mozmar/ee-infra-private/pull/82

TODO

[x] update https://github.com/mozmar/infra/issues/156
[x] update https://github.com/mozmar/ee-infra-private/pull/82
[x] ~create ELB Terraform issue to track in our project~: https://github.com/mozmar/infra/issues/157
[x] create nginx redirection service issue to track in our project: https://github.com/mozmar/infra/issues/160
[x] decommission snippets LoadBalancer
[x] decommission careers LoadBalancer

cc @jgmize @glogiotatidis

bookshelfdave commented 7 years ago

We need the following additional automation:

Create a new security group with 30000-32767 range per VPN/region
assign this security group to each new ELB
add this new security group to nodes.<k8s_cluster> security group

glogiotatidis commented 7 years ago

For the record we've always been using TCP (never enabled HTTP on careers or snippets due to bug #156 which did not get fixed). So we don't have evidence that HTTP does not work for careers. For snippets I was not around during the http experiment (timezones suck) so I can't comment but I know that we mostly run snippets over TCP since toronto and we're continuously experiencing timeouts. So I'm not sure what this bug buys us in terms of timeouts.

I discussed the http->https forward service with @jgmize and i agree it's a solution. I prefer the flexibility of managing redirects in the app and not via an external service but that also works. I think managing the ELBs outside k8s complicates things, especially for non-SREs.

I'll complete non ELB related tasks for #140 and #141 so you're unblocked to do your magic. 🍺

bookshelfdave commented 7 years ago

I feel like we need a quick regroup on this.

@glogiotatidis do you have a preference for K8s managed ELB's?

glogiotatidis commented 7 years ago

@glogiotatidis do you have a preference for K8s managed ELB's?

re-read by comment and changed "I think managing the ELBs outside k8s complites things, especially for non-SREs." to "I think managing the ELBs outside k8s complicates things, especially for non-SREs."

I find k8s easier to deal with than tf so I would say yes, given that we're able to accomplish what we wan with k8s annotations, and to the limited extend I understand k8s and elbs we can.

bookshelfdave commented 7 years ago

I did some mindmapping on this, here's what I think are the pros/cons of each solution:

mozmeao_elbs_xmind

(click the image to expand)

jgmize commented 7 years ago

I would love to have a fix for #153 (red Xs in @metadave's awesome mind map) that didn't require management of ELBs outside of k8s itself. Another option that @metadave and I discussed but haven't tested yet is to patch each of the master nodes to be unschedulable, as suggested in https://github.com/kubernetes/kops/issues/639#issuecomment-287015882. DaemonSets should not be affected by this in versions 1.6 and below, but that may change in 1.7 so we would need to keep this in mind for future upgrades-- hopefully the k8s issue would be resolved in that same release though.

jgmize commented 7 years ago

I personally would prefer to deal with http->https redirects outside of the apps, as it simplifies the application code and should give a minor performance improvement at the app level. This can be done with sidecar containers on k8s managed ELBs, or as independent services with tf managed ELBs.

bookshelfdave commented 7 years ago

We're suspicious of the ELB TCP healthchecks causing Gunicorn issues.

jgmize commented 7 years ago

In order to switch to HTTP healthchecks, I needed to set ALLOWED_HOSTS=* for both careers and snippets, as the healthchecks are by IP and there is no way to set the host header. Also, ideally we should implement a /healthz for each app.

bookshelfdave commented 7 years ago

ELB Security group automation here

glogiotatidis commented 7 years ago

I planned to create /healthz for both anyway to 👍 But how /healthz is going to help with ALLOWED_HOSTS?

Also I found https://dryan.com/articles/elb-django-allowed-hosts/

jgmize commented 7 years ago

/healthz won't help with ALLOWED_HOSTS. That was meant as a side comment, not a solution-- my apologies for the lack of clarity in my original comment, and I've edited it to add an "Also, " in front.

jgmize commented 7 years ago

@glogiotatidis would you mind linking replying with links to the issues tracking /healthz in snippets and careers here? Also, I like the suggestion to append '.compute-1.amazonaws.com' to the list of ALLOWED_HOSTS instead of using '*'; let's give that a shot soon.

jgmize commented 7 years ago

Also, let me reiterate that managing ELBs directly with TF is a temporary workaround, not a long term solution. Let me also clarify that while I have a personal preference on the http->https redirects, I have no strong objections to other approaches.

bookshelfdave commented 7 years ago

http->https redirection PR here

bookshelfdave commented 7 years ago

we still need to decom K8s-managed snippets/careers ELB's before this gets closed out

jgmize commented 7 years ago

old k8s-managed snippets & careers ELBs decommed in #175.

mozmeao / infra