AWS BinderHub deploy failed on staging

TomAugspurger commented 4 years ago

From https://app.circleci.com/pipelines/github/pangeo-data/pangeo-binder/154/workflows/242b4d43-abb5-4b07-b5fa-518ec8e7a09f/jobs/161, which was deploying https://github.com/pangeo-data/pangeo-binder/pull/159

#!/bin/bash -eo pipefail
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.38/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml
helm upgrade --wait --install \
  ${CIRCLE_BRANCH} pangeo-binder \
  --namespace=${CIRCLE_BRANCH} --version=v0.2.0 \
  -f ./deploy-aws/${CIRCLE_BRANCH}.yaml \
  -f ./secrets-aws/${CIRCLE_BRANCH}.yaml
#helm history ${CIRCLE_BRANCH} -n ${CIRCLE_BRANCH}

Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/alertmanagers.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/podmonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheuses.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/prometheusrules.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/servicemonitors.monitoring.coreos.com configured
Warning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply
customresourcedefinition.apiextensions.k8s.io/thanosrulers.monitoring.coreos.com configured
Error: UPGRADE FAILED: rendered manifests contain a new resource that already exists. Unable to continue with update: existing resource conflict: namespace: staging, name: binderhub, existing_kind: networking.k8s.io/v1beta1, Kind=Ingress, new_kind: networking.k8s.io/v1beta1, Kind=Ingress

Exited with code exit status 1

CircleCI received exit code 1

cc @salvis2. I don't know if the changes in https://github.com/pangeo-data/pangeo-binder/pull/159 would cause that. Perhaps the change at https://github.com/pangeo-data/pangeo-binder/pull/159/files#diff-590371a4912f3ac827ca1b75524c8582L23, though deploy-aws should already override that.

salvis2 commented 4 years ago

Digging in.

salvis2 commented 4 years ago

So with the line you referenced above, I see we have dropped kube-lego in favor of nginx. With

kubectl get ingress -A

I see that there are some kube-lego-nginx ingress objects (not confusing at all). Are those fine?

There's also the binderhub ingress object that the error above references, and describeing it, I see this line in the config:

TLS:
  kubelego-tls-binder-staging terminates staging.aws-uswest2-binder.pangeo.io

The name kubelego-tls-binder-staging is not something I can find in the staging.yaml. I can find the equivalent line in the staging-grafana ingress object and the name listed in its TLS config is something I can find in staging.yaml. The prod ingress object binderhub has the same problem.

I think we should manually delete these ingress objects. Sound good? We can re-create them manually once and then let CI take over, or should we just let CI re-create them?

salvis2 commented 4 years ago

Also, I have no clue what networking.k8s.io/v1beta1 means. I have yet to see it on a resource.

Edit: it's an API version. See the k8s docs.

TomAugspurger commented 4 years ago

I think it's an older API that's been stabilized.

I might be wrong, but I may have seen these issues on GCP, and may have manually deleted them. I don't recall 100%.

On Fri, Oct 9, 2020 at 1:18 PM Sebastian Alvis notifications@github.com wrote:

Also, I have no clue what networking.k8s.io/v1beta1 means. I have yet to see it on a resource.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-binder/issues/161#issuecomment-706332645, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQQYCZ4KQFNNMRP5NLSJ5HWJANCNFSM4SJ66FVQ .

salvis2 commented 4 years ago

Also, running

kubectl get deployments -n staging

shows

api-staging-dask-gateway
binder
controller-staging-dask-gateway
hub
proxy
staging-grafana
staging-kube-lego
staging-nginx-ingress-controller
staging-nginx-ingress-default-backend
staging-prometheus-operato-operator
traefik-staging-dask-gateway
user-scheduler

The staging-kube-lego should probably go?

TomAugspurger commented 4 years ago

Yeah, I see

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
api-staging-dask-gateway                        1/1     1            1           174d
binder                                          1/1     1            1           199d
binderhub-proxy-nginx-ingress-controller        1/1     1            1           30d
binderhub-proxy-nginx-ingress-default-backend   1/1     1            1           30d
controller-staging-dask-gateway                 1/1     1            1           174d
hub                                             1/1     1            1           199d
proxy                                           1/1     1            1           199d
staging-nginx-ingress-controller                1/1     1            1           199d
staging-nginx-ingress-default-backend           1/1     1            1           199d
traefik-staging-dask-gateway                    1/1     1            1           174d

so removing it seems reasonable.

salvis2 commented 4 years ago

Cool. I've deleted the deployments staging-kube-lego and prod-kube-lego from their respective namespaces. I'll delete the binderhub ingress object in staging and try to re-run the helm install command manually.

salvis2 commented 4 years ago

Green lights as of https://github.com/pangeo-data/pangeo-binder/commit/a5380b98fabefc5cc9874c74b79476a7a7aad428

pangeo-data / pangeo-binder

AWS BinderHub deploy failed on staging #161