Closed jhamman closed 4 years ago
Worth noting that I had this problem with the traefik in z2jh above, and a slightly newer release of the JH chart (0.9.0-beta.4.n008.hb20ad22
: no problem, 0.9.0-beta.4
: has the traefik problem) fixes the problem for me, so if somebody is having trouble w/ pangeo they could see the difference in the traefik config there for a possible solution?
Thanks @snickell - unfortunately, we're already running with 0.9.0-beta.4.n008.hb20ad22
so I don't think that is the problem.
I think I have this working now. I think what is happening is that the autohttps
pod's traefik
container is starting up before the rest of the deployment/network is ready to go. The solution I've found is just to delete the autohttps
pod and let it come back to life on its own:
~/Dropbox/src/pangeo-cloud-federation update-https-proxy→upstream/staging ● ⍟2 helm upgrade --wait --install --namespace ocean-staging ocean-staging pangeo-deploy -f deployments/ocean/config/common.yaml -f deployments/ocean/config/staging.yaml -f deployments/ocean/secrets/staging.yaml --cleanup-on-failRelease "ocean-staging" does not exist. Installing it now.
NAME: ocean-staging
LAST DEPLOYED: Fri Mar 13 14:21:15 2020
NAMESPACE: ocean-staging
STATUS: deployed
REVISION: 1
TEST SUITE: None
~/Dropbox/src/pangeo-cloud-federation update-https-proxy→upstream/staging ● ⍟2 kubectl logs autohttps-d44df9478-5j7p4 traefik -n ocean-staging -f ✔ 10653 14:22:57
time="2020-03-13T21:21:21Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:21:21Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:21:21Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:21:21Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:21:21Z" level=info msg="Starting provider *traefik.Provider {}"
time="2020-03-13T21:21:28Z" level=info msg=Register... providerName=le.acme
time="2020-03-13T21:21:40Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
time="2020-03-13T21:21:41Z" level=error msg="Unable to obtain ACME certificate for domains \"staging.ocean.pangeo.io\" : unable to generate a certificate for the domains [staging.ocean.pangeo.io]: acme: Error -> One or more domains had a problem:\n[staging.ocean.pangeo.io] acme: error: 400 :: urn:ietf:params:acme:error:connection :: Fetching http://staging.ocean.pangeo.io/.well-known/acme-challenge/LkF-zkQujguOBaF8hZ_a6AOk6vBFUzOTVjlIUr8CF3Y: Timeout during connect (likely firewall problem), url: \n" providerName=le.acme
^C
~/Dropbox/src/pangeo-cloud-federation update-https-proxy→upstream/staging ● ⍟2 kubectl delete pod -n ocean-staging autohttps-d44df9478-5j7p4 SIGINT(2) ↵ 10654 14:23:20
pod "autohttps-d44df9478-5j7p4" deleted
~/Dropbox/src/pangeo-cloud-federation update-https-proxy→upstream/staging ● ⍟2 kubectl logs autohttps-d44df9478-rjtjh traefik -n ocean-staging -f ✔ 10655 14:24:15
time="2020-03-13T21:23:40Z" level=info msg="Configuration loaded from file: /etc/traefik/traefik.toml"
time="2020-03-13T21:23:40Z" level=info msg="Traefik version 2.1.6 built on 2020-02-28T17:40:18Z"
time="2020-03-13T21:23:40Z" level=info msg="\nStats collection is disabled.\nHelp us improve Traefik by turning this feature on :)\nMore details on: https://docs.traefik.io/contributing/data-collection/\n"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider aggregator.ProviderAggregator {}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *file.Provider {\"watch\":true,\"filename\":\"/etc/traefik/dynamic.toml\"}"
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *acme.Provider {\"email\":\"raphael.dussin@gmail.com\",\"caServer\":\"https://acme-v02.api.letsencrypt.org/directory\",\"storage\":\"/etc/acme/acme.json\",\"keyType\":\"RSA4096\",\"httpChallenge\":{\"entryPoint\":\"http\"},\"ResolverName\":\"le\",\"store\":{},\"ChallengeStore\":{}}"
time="2020-03-13T21:23:40Z" level=info msg="Testing certificate renew..." providerName=le.acme
time="2020-03-13T21:23:40Z" level=info msg="Starting provider *traefik.Provider {}"
And the hub comes online.
@jhamman @tjcrone In order to get this update to work on AWS I had to run the following commands locally: ➜ ~ helm version version.BuildInfo{Version:"v3.1.2", GitCommit:"d878d4d45863e42fd5cff6743294a11d28a9abce", GitTreeState:"clean", GoVersion:"go1.14"}
cd pangeo-cloud-federation
git pull upstream staging
cd pangeo-deploy
helm repo add pangeo https://pangeo-data.github.io/helm-chart/
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add dask-gateway https://dask.org/dask-gateway-helm-repo/
helm repo update
helm dependency update
cd ../
kubectl delete deployment autohttps -n icesat2-staging
kubectl delete rolebinding -n icesat2-staging autohttps
kubectl delete role -n icesat2-staging autohttps
helm upgrade --wait --install --cleanup-on-fail --namespace icesat2-staging icesat2-staging pangeo-deploy -f deployments/icesat2/config/common.yaml -f deployments/icesat2/config/staging.yaml -f deployments/icesat2/secrets/staging.yaml
Update succeeds but problem with autoHTTPS - going to the login page results in ERR_SSL_PROTOCOL_ERROR
kubectl delete pod -n icesat2-staging autohttps-7cb6845966-fvjhs
and we're back in business
@scottyhq!! Thank you! This was very helpful. I was having the same ERR_SSL_PROTOCOL_ERROR error that you had, but ran through all of the steps you provided here and we are back up on staging, and looks like the CI is working fine as well for staging. Awesome. Thank you very much for providing these steps.
possibly related to #560
In an effort to debug #560, I tore down staging.hydro.pangeo.io and redeployed it in a fresh environment. This went as expected:
But there seems to be a problem with the proxy:
cc @consideRatio