Closed iameskild closed 2 years ago
I've been having trouble upgrading from 0.3.12 on AWS (using Auth0) to the version of qhub on main
(ie. export QHUB_GH_BRANCH=main
). On the deploy step, the error I keep running into is the following:
[terraform]: │ Error: Get "http://localhost/api/v1/namespaces/dev": dial tcp [::1]:80: connect: connection refused
I've seen errors like this in past but I haven't been able to get around it. @danlester do you have any idea why this might be failing or if there are additional steps I need to take?
@iameskild Not too sure, but we can have a call if you want to look together.
@danlester I've attempted another upgrade with the same results. I will try to perform an upgrade from 0.3.13 to main for another cloud provider and see if I get it working. I'm free to jump on a call whenever is convenient for you, thanks for you help!
I don't think there will be much difference, but I would suggest also trying 0.3.12 to main for another cloud provider, so you're changing less for comparison.
It could also be worth trying with password
instead of auth0
to see if that works - I have done most testing under password
.
@danlester I was able to upgrade from 0.3.12 to 0.4.0 (main) running on DO using password
. I made the following adjustments
qhub
(bumped version to v0.4.0
) into qhub-main
conda envv0.3.14
Unfortunately the hub
pod never came back up. This made it so I couldn't test importing existing users or verify that the user data is still intact.
hub
pod logs:
@iameskild This is the same problem that Vini faced: https://github.com/Quansight/qhub/pull/967#issuecomment-1005712132
I'm not too sure why you manually updated the image tags to v0.3.14
. The qhub upgrade
should have already set them to v0.3.14
- but only if they started off as v0.3.12
in the qhub-config.yaml file. Ultimately, when qhub (Python module) has its internal version number at v0.4.0
, qhub upgrade
should end up at v0.4.0
for the image tags instead.
But since the qhub repo doesn't yet have a v0.4.0
tag, no corresponding images exist in Docker Hub, so you would really need to (manually) use main
as the image tag to get the versions based on our latest code.
If you still have the broken site running, try updating the image tags in qhub-config.yaml and redeploy - it will still be a helpful test I think.
Still happy to have a call to go through all of this together.
Redeploying with image tags set to main
resolves this issue. After importing the users and logging in, the user data remains intact :)
I still want to go back and test upgrading a QHub instance that uses Auth0.
Upgrading qhub (on AWS, using Auth0) from v0.3.12
to v0.4.0
failed during the deployment process. I tried the same upgrade and deploy on DO and while it successfully deployed and I could import users, I could login due to the following:
I also noticed a few bizarre Terraform outputs:
@danlester are you available to troubleshoot together tomorrow after the QHub sync?
@danlester capturing the Terraform logs led me to:
Invalid provider configuration was supplied. Provider operations likely to fail: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable:
Googling this, I found an issue on the terraform-aws-eks repo. Here, one on of the top recommendations was https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1234#issuecomment-787936210.
export KUBE_CONFIG_PATH=/Users/eskild/.kube/config
With this trick, the deployment seemed to be working but then it started deleting subnet resources and errored out, leaving the cluster in an half-deleted state.
Logs in this gist.
@iameskild I believe I've solved this particular problem (Terraform trying to access localhost cluster) in the following issue which gives more details. It has a corresponding PR - please review:
Kubeconfig state unavailable, Terraform defaults to localhost
However, (in AWS) it leads me to the problem you were seeing about subnet resources being replaced. (Some outputs below). Once it wants to replace the node groups, the apply will never finish since the nodes can't be destroyed until the cluster has its contents removed safely.
By the way, I tried the upgrade on AWS and got the same localhost error using password auth (not Auth0) - I don't think the auth type has anything to do with it, and you were just lucky if you got password upgrade to work before - or maybe something has changed since!
As discussed, the login problem you saw with Auth0 above is because the callback URL needs to be changed, and we need to advise the user in qhub upgrade
- issue nebari-dev/nebari#991 for you.
I think it's something to do with CIDR changes:
[terraform]: # module.network.aws_subnet.main[0] must be replaced
[terraform]: -/+ resource "aws_subnet" "main" {
[terraform]: ~ arn = "arn:aws:ec2:eu-west-2:892486800165:subnet/subnet-0aede967b72f0907b" -> (known after apply)
[terraform]: ~ availability_zone_id = "euw2-az2" -> (known after apply)
[terraform]: ~ cidr_block = "10.10.0.0/20" -> "10.10.0.0/18" # forces replacement
[terraform]: ~ id = "subnet-0aede967b72f0907b" -> (known after apply)
[terraform]: + ipv6_cidr_block_association_id = (known after apply)
I would take a look where these have been changed (e.g. vpc_cidr_newbits and vpc_cidr_block in the code), find out why, and see if they can at least be preserved for old installations.
@iameskild just to keep in mind during tests
CICD workflows have been tested and a PR for the relevant bug fixes/modifications has been opened: nebari-dev/nebari#1086
Azure issues seem in integration tests does not affect fresh local deployments
@danlester @HarshCasper Have you tested the qhub upgrade command for the above version migrations? just to know if that still needs to be tested :smile:
v0.4.0 released. Closing issue 🙌
Checklist:
Validate successful
qhub deploy
andqhub destroy
for each provider:[x] AWS
Validate the following services:
[x] Azure
Validate the following services:
[ ] DO
Validate the following services:
[x] GGP
Validate the following services:
[x] local/existing kubernetes cluster/minikube
Validate the following services:
Validate
qhub upgrade
is successful for each provider:[ ] AWS
v0.3.12
/v0.3.13
/v0.3.14
tov0.4.0
[ ] Azure
v0.3.12
/v0.3.13
/v0.3.14
tov0.4.0
[ ] DO
v0.3.12
/v0.3.13
/v0.3.14
tov0.4.0
[ ] GCP
v0.3.12
/v0.3.13
/v0.3.14
tov0.4.0
[ ] local/existing kubernetes deployment/minikube
v0.3.12
/v0.3.13
/v0.3.14
tov0.4.0
Validate
qhub-ops.yaml
workflow(outdated)