pangeo-data / pangeo-cloud-federation

Deployment automation for Pangeo JupyterHubs on AWS, Google, and Azure
https://pangeo.io/cloud.html
59 stars 32 forks source link

Problem with Traefik Proxy Deployment #563

Closed jhamman closed 4 years ago

jhamman commented 4 years ago

possibly related to #560

In an effort to debug #560, I tore down staging.hydro.pangeo.io and redeployed it in a fresh environment. This went as expected:

helm upgrade --wait --install --namespace hydro-staging hydro-staging pangeo-deploy -f deployments/hydro/config/common.yaml -f deployments/hydro/config/staging.yaml -f deployments/hydro/secrets/staging.yaml
Release "hydro-staging" does not exist. Installing it now.
NAME: hydro-staging
LAST DEPLOYED: Thu Mar 12 13:59:32 2020
NAMESPACE: hydro-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None

But there seems to be a problem with the proxy:

$ kubectl logs autohttps-779cc6866d-p4wxk traefik -n hydro-staging
time="2020-03-12T20:59:41Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-12T20:59:41Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-12T20:59:41Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *acme.Provider {\"email\":\"jhamman@ucar.edu\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-12T20:59:41Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-12T20:59:41Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-12T20:59:48Z" level=info msg=Register... providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-12T21:00:01Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.hydro.pangeo.io\" : unable to generate a certificate for the domains [staging.hydro.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.hydro.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.hydro.pangeo.io/.well-known/acme-challenge/gXjQzQXrdwo3ZtwznBXAqK8QeNm7EYbRDANxzmviLns: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme

cc @consideRatio

scottyhq commented 4 years ago

related: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1594

snickell commented 4 years ago

related: jupyterhub/zero-to-jupyterhub-k8s#1594

Worth noting that I had this problem with the traefik in z2jh above, and a slightly newer release of the JH chart (0.9.0-beta.4.n008.hb20ad22: no problem, 0.9.0-beta.4: has the traefik problem) fixes the problem for me, so if somebody is having trouble w/ pangeo they could see the difference in the traefik config there for a possible solution?

jhamman commented 4 years ago

Thanks @snickell - unfortunately, we're already running with 0.9.0-beta.4.n008.hb20ad22 so I don't think that is the problem.

jhamman commented 4 years ago

I think I have this working now. I think what is happening is that the autohttps pod's traefik container is starting up before the rest of the deployment/network is ready to go. The solution I've found is just to delete the autohttps pod and let it come back to life on its own:

 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  helm upgrade --wait --install --namespace ocean-staging ocean-staging pangeo-deploy -f deployments/ocean/config/common.yaml -f deployments/ocean/config/staging.yaml -f deployments/ocean/secrets/staging.yaml --cleanup-on-failRelease "ocean-staging" does not exist. Installing it now.
NAME: ocean-staging
LAST DEPLOYED: Fri Mar 13 14:21:15 2020
NAMESPACE: ocean-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-5j7p4 traefik -n ocean-staging -f                                           ✔  10653  14:22:57
time="2020-03-13T21:21:21Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:21:21Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:21:21Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:21:21Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-13T21:21:28Z" level=info msg=Register... providerName=le.acme
time="2020-03-13T21:21:40Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-13T21:21:41Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
^C
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl delete pod -n ocean-staging autohttps-d44df9478-5j7p4                                      SIGINT(2) ↵  10654  14:23:20
pod "autohttps-d44df9478-5j7p4" deleted
 ~/Dropbox/src/pangeo-cloud-federation   update-https-proxy→upstream/staging ● ⍟2  kubectl logs autohttps-d44df9478-rjtjh traefik -n ocean-staging -f                                           ✔  10655  14:24:15
time="2020-03-13T21:23:40Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:23:40Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:23:40Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:23:40Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *traefik.Provider {}"

And the hub comes online.

scottyhq commented 4 years ago

@jhamman @tjcrone In order to get this update to work on AWS I had to run the following commands locally: ➜ ~ helm version version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.14"}

cd pangeo-cloud-federation
git pull upstream staging
cd pangeo-deploy
helm repo add pangeo https://pangeo-data.github.io/helm-chart/
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add dask-gateway https://dask.org/dask-gateway-helm-repo/
helm repo update
helm dependency update
cd ../
kubectl delete deployment autohttps -n icesat2-staging
kubectl delete rolebinding -n icesat2-staging autohttps
kubectl delete role -n icesat2-staging autohttps
helm upgrade --wait --install --cleanup-on-fail --namespace icesat2-staging icesat2-staging pangeo-deploy -f deployments/icesat2/config/common.yaml -f deployments/icesat2/config/staging.yaml -f deployments/icesat2/secrets/staging.yaml

Update succeeds but problem with autoHTTPS - going to the login page results in ERR_SSL_PROTOCOL_ERROR

kubectl delete pod -n icesat2-staging autohttps-7cb6845966-fvjhs and we're back in business

tjcrone commented 4 years ago

@scottyhq!! Thank you! This was very helpful. I was having the same ERR_SSL_PROTOCOL_ERROR error that you had, but ran through all of the steps you provided here and we are back up on staging, and looks like the CI is working fine as well for staging. Awesome. Thank you very much for providing these steps.