harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.84k stars 324 forks source link

[BUG] Can Not Upgrade Cluster When ClusterOutput & ClusterFlow Configured #3715

Closed irishgordo closed 8 months ago

irishgordo commented 1 year ago

Describe the bug Running into an issue where either on:

To Reproduce

Prerequisites:

  1. Create an instance of ElasticSearch running so that a cluster can access it via:
  2. sudo sysctl -w vm.max_map_count=262144
  3. sudo docker run --name elasticsearch -p 9200:9200 -p 9300:9300 -e xpack.security.enabled=false -e node.name=es01 -it docker.elastic.co/elasticsearch/elasticsearch:6.8.23
  4. spin up kibana for easier viewing, sudo docker run --name kibana --link elasticsearch:es_alias --env "ELASTICSEARCH_URL=http://es_alias:9200" -p 5601:5601 -it docker.elastic.co/kibana/kibana:6.8.23

In ElasticSearch PreReqs: 1.have built out in ElasticSearch: an elastic search user, via, replace localhost as needed:

  curl --location --request POST 'http://localhost:9200/security/user/harvesteruser' \
--header 'Content-Type: application/json' \
--data-raw '{
    "password": "harvestertesting",
    "enabled": true,
    "roles": ["superuser", "kibana_admin"],
    "full_name": "Harvesterrr Testingggguser",
    "email": "harvestertesting@harvestertesting.com"
}'
  1. have built out in ElasticSearch: check, verifying the elasticsearch user, replace localhost as needed via:
    curl --location --request GET 'localhost:9200/security/user/harvesteruser'
  2. have built out in ElasticSearch: build the elasticsearch index, replace localhost as needed via:
    curl --location --request PUT 'localhost:9200/harvesterindex?pretty'

Steps to reproduce the behavior:

  1. Build a Secret in cattle-logging namespace with Advanced -> Secrets, then provide the password for the harvesteruser
  2. Build a Cluster Output, point to the ElasticSearch, use the secret built for the password, remove checked box for the TLS
  3. Build a Cluster Flow, select Logging, point to the Cluster Output
  4. Have a File-Server that holds the v1.1-head iso, a version.yaml w/ appropriate checksum
  5. Have for safety, created the 'pre-flight' adjustment:
    
    $ cat > /tmp/fix.yaml <<EOF
    spec:
    values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
    EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge $ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller


1. Then fire off the upgrade
1. The admission webhook controller surrounding the logging should be exposed

**Expected behavior**
The Upgrade to be allowed to be run

**Support bundle**
v1.1.2-rc3 Cluster:
[supportbundle_fe8a3fc4-3594-434c-a956-d35207f79079_2023-03-23T18-11-56Z.zip](https://github.com/harvester/harvester/files/11054453/supportbundle_fe8a3fc4-3594-434c-a956-d35207f79079_2023-03-23T18-11-56Z.zip)
v1.1-a72900af-head Cluster:
[supportbundle_a8d32ff1-0769-4a05-85be-ebe21ffef525_2023-03-23T18-12-16Z.zip](https://github.com/harvester/harvester/files/11054456/supportbundle_a8d32ff1-0769-4a05-85be-ebe21ffef525_2023-03-23T18-12-16Z.zip)

**Environment**
 - Harvester ISO version: v1.1.2-rc3 & v1.1-a72900af-head
 - Underlying Infrastructure: bare-metal & qemu/kvm

**Additional context**
![Screenshot from 2023-03-23 10-56-06](https://user-images.githubusercontent.com/5370752/227308032-10c080c2-b192-407e-9567-e7565950b4d9.png)
![Screenshot from 2023-03-23 10-38-40](https://user-images.githubusercontent.com/5370752/227308044-6347f6f0-1088-4a22-aab2-1daa3a0f4f3c.png)
w13915984028 commented 1 year ago

from the debug log:

the non-rc3 hits #3616 before upgrade, the upgrade validator denis the upgrade.

# Run kubectl commands inside here
# e.g. kubectl get all
> kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   local-managed-system-agent                    0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   mcc-harvester                                 1/1                       
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   mcc-rancher-logging                           0/1                       ErrApplied(1) [Cluster fleet-local/local: another operation (install/upgrade/rollback) is in progress]
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring                        1/1                       
fleet-local   mcc-rancher-monitoring-crd    
> kubectl get managedchart -A
NAMESPACE     NAME                                      AGE
fleet-local   harvester                                 21d
fleet-local   harvester-crd                             21d
fleet-local   local-managed-system-upgrade-controller   21d
fleet-local   rancher-logging                           21d
fleet-local   rancher-logging-crd                       21d
fleet-local   rancher-monitoring                        21d
fleet-local   rancher-monitoring-crd                    21d
w13915984028 commented 1 year ago

rc3 version: will check why harvester bundle is in such a state.

# Run kubectl commands inside here
# e.g. kubectl get all
> kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 0/1                       Modified(1) [Cluster fleet-local/local]; kubevirt.kubevirt.io harvester-system/kubevirt modified {"spec":{"customizeComponents":{"patches":[{"patch":"{\"webhooks\":[{\"name\":\"kubevirt-validator.kubevirt.io\",\"failurePolicy\":\"Ignore\"},{\"name\":\"kubevirt-update-validator.kubevirt.io\",\"failurePolicy\":\"Ignore\"}]}","resourceName":"virt-operator-validator","resourceType":"ValidatingWebhookConfiguration","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-api\", \"resources\":{\"limits\":{\"cpu\":\"400m\",\"memory\":\"1100Mi\"}}}]}}}}","resourceName":"virt-api","resourceType":"Deployment","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-controller\", \"resources\":{\"limits\":{\"cpu\":\"800m\",\"memory\":\"1300Mi\"}}}]}}}}","resourceName":"virt-controller","resourceType":"Deployment","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-handler\", \"resources\":{\"limits\":{\"cpu\":\"700m\",\"memory\":\"1600Mi\"}}}]}}}}","resourceName":"virt-handler","resourceType":"DaemonSet","type":"strategic"}]}}}
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging                           1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring                        1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1                       
> kubectl get managedchart -A
NAMESPACE     NAME                                      AGE
fleet-local   harvester                                 41h
fleet-local   harvester-crd                             41h
fleet-local   local-managed-system-upgrade-controller   41h
fleet-local   rancher-logging                           41h
fleet-local   rancher-logging-crd                       41h
fleet-local   rancher-monitoring                        41h
fleet-local   rancher-monitoring-crd                    41h

fleet-local mcc-harvester 0/1 Modified(1) [Cluster fleet-local/local]; kubevirt.kubevirt.io harvester-system/kubevirt modified {"spec":{"customizeComponents":{"patches":[{"patch":"{\"webhooks\":[{\"name\":\"kubevirt-validator.kubevirt.io\",\"failurePolicy\":\"Ignore\"},{\"name\":\"kubevirt-update-validator.kubevirt.io\",\"failurePolicy\":\"Ignore\"}]}","resourceName":"virt-operator-validator","resourceType":"ValidatingWebhookConfiguration","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-api\", \"resources\":{\"limits\":{\"cpu\":\"400m\",\"memory\":\"1100Mi\"}}}]}}}}","resourceName":"virt-api","resourceType":"Deployment","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-controller\", \"resources\":{\"limits\":{\"cpu\":\"800m\",\"memory\":\"1300Mi\"}}}]}}}}","resourceName":"virt-controller","resourceType":"Deployment","type":"strategic"},{"patch":"{\"spec\":{\"template\":{\"spec\":{\"containers\":[{\"name\":\"virt-handler\", \"resources\":{\"limits\":{\"cpu\":\"700m\",\"memory\":\"1600Mi\"}}}]}}}}","resourceName":"virt-handler","resourceType":"DaemonSet","type":"strategic"}]}}}

be related to https://github.com/harvester/harvester/commit/8c620b7dbc2f218c3714aa185940929fc38e796f ?

w13915984028 commented 1 year ago

The rc3 seems complaining something related to

https://github.com/harvester/harvester/commit/8c620b7dbc2f218c3714aa185940929fc38e796f

will check if it is in that state in a newly installed cluster.

irishgordo commented 1 year ago

@w13915984028 On a brand new single node v1.1.2-rc3 cluster without:

This is seen with:

# Run kubectl commands inside here
# e.g. kubectl get all
> kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 1/1                       
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging                           1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring                        1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1                       
> kubectl get managedchart -A
NAMESPACE     NAME                                      AGE
fleet-local   harvester                                 5m1s
fleet-local   harvester-crd                             5m1s
fleet-local   local-managed-system-upgrade-controller   5m1s
fleet-local   rancher-logging                           5m1s
fleet-local   rancher-logging-crd                       5m1s
fleet-local   rancher-monitoring                        5m1s
fleet-local   rancher-monitoring-crd                    5m1s
irishgordo commented 1 year ago

after spinning up a vm, configuring cluster flow & cluster output it still yields:


# Run kubectl commands inside here
# e.g. kubectl get all
> kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 1/1                       
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging                           1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring                        1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1                       
> kubectl get managedcharts -A
NAMESPACE     NAME                                      AGE
fleet-local   harvester                                 14m
fleet-local   harvester-crd                             14m
fleet-local   local-managed-system-upgrade-controller   14m
fleet-local   rancher-logging                           14m
fleet-local   rancher-logging-crd                       14m
fleet-local   rancher-monitoring                        14m
fleet-local   rancher-monitoring-crd                    14m
> 

for v1.1.2-rc3 single node

Then after changing the systemUpgradeJobActiveDeadlineSeconds and such it still yields:

# Run kubectl commands inside here
# e.g. kubectl get all
> kubectl get bundle -A
NAMESPACE     NAME                                          BUNDLEDEPLOYMENTS-READY   STATUS
fleet-local   fleet-agent-local                             1/1                       
fleet-local   local-managed-system-agent                    1/1                       
fleet-local   mcc-harvester                                 1/1                       
fleet-local   mcc-harvester-crd                             1/1                       
fleet-local   mcc-local-managed-system-upgrade-controller   1/1                       
fleet-local   mcc-rancher-logging                           1/1                       
fleet-local   mcc-rancher-logging-crd                       1/1                       
fleet-local   mcc-rancher-monitoring                        1/1                       
fleet-local   mcc-rancher-monitoring-crd                    1/1                       
> kubectl get managedcharts -A
NAMESPACE     NAME                                      AGE
fleet-local   harvester                                 17m
fleet-local   harvester-crd                             17m
fleet-local   local-managed-system-upgrade-controller   17m
fleet-local   rancher-logging                           17m
fleet-local   rancher-logging-crd                       17m
fleet-local   rancher-monitoring                        17m
fleet-local   rancher-monitoring-crd                    17m

Then trying to create the upgrade, it was able to be created.

irishgordo commented 1 year ago

Changing this to Reproduce Rare since it seems to be able to not be as easily reproduced...

w13915984028 commented 1 year ago

from the kubevirts.yaml, a special value is in the KubeVirt object, tha's different with the default value, it may cause the complains from fleet.

image: "image":"registry.suse.com/harvester-beta/virt-controller:0.54.0-1"

- apiVersion: kubevirt.io/v1
  kind: KubeVirt

      - patch: '{"spec":{"template":{"spec":{"containers":[{"name":"virt-controller",
          "image":"registry.suse.com/harvester-beta/virt-controller:0.54.0-1","imagePullPolicy":"Always"}]}}}}'
        resourceName: virt-controller
        resourceType: Deployment
        type: strategic

image

w13915984028 commented 1 year ago

my local master-head release shows:

harv2:~ # kubectl get pods -n harvester-system virt-controller-5d54b8b9bf-rnw5g -oyaml | grep image
    - --launcher-image
    image: registry.suse.com/suse/sles/15.4/virt-controller:0.54.0-150400.3.7.1
    imagePullPolicy: IfNotPresent
    image: registry.suse.com/suse/sles/15.4/virt-controller:0.54.0-150400.3.7.1
    imageID: sha256:30c23294b1b9fad7e729d52b3f0a296d16bc6d735c785f2e5e88fb4e7c7cf668
harv2:~ # 
w13915984028 commented 1 year ago

the image of registry.suse.com/harvester-beta/virt-controller:0.54.0-1 is patched to virt-controller, but virt-operator is normal

...

image

irishgordo commented 1 year ago

I believe that patch was due to testing: https://github.com/harvester/harvester/wiki/Replace-KubeVirt-virt-controller-and-other-KubeVirt-component-images

w13915984028 commented 1 year ago

@irishgordo @bk201 @guangbochen

There are 2 screnarios here: (1) In the v1.1.2-rc3 release upgrade test, the upgrade check blocks upgrade due to the temp patch to virt-controller, the fleet-agent complains that the harvester bundle is modified. The output meets the design.

(2) In v1.1-head release upgrade test, the same issue with #3616 was encountered.

At the moment, we have no more planned fix for this issue.

How to continue this issue? thanks.

guangbochen commented 1 year ago

Will monitoring this issue when dealing with the v1.2.0-rc upgrade, thanks.

bk201 commented 1 year ago

This happens when we manually patch kube-virt image https://github.com/harvester/harvester/issues/3715#issuecomment-1481965453 @Vicente-Cheng, we need a way to deal with this case.

w13915984028 commented 1 year ago

seems we could add following to harvester/harvester-installer/pkg/config/templates/rancherd-10-harvester.yaml

    - apiVersion: kubevirt.io/v1
      jsonPointers:
      - /spec/customizeComponents
      kind: KubeVirt
      name: kubevirt

let fleet-agent skip checking the changes in spec.customizeComponents, then we can patch kube-virt

needs to be in both v1.1.3 and v1.2.0

please @Vicente-Cheng help verify, thanks.

guangbochen commented 1 year ago

Move this issue to v1.2.1 for a note; modifying the default kubevirt config is not supported in the current stage, and since the kubevirt patch is already included in Harvester v1.2.0, so users will need to revert it before the upgrade.

Vicente-Cheng commented 10 months ago

Update the https://github.com/harvester/harvester/wiki/Replace-KubeVirt-virt-controller-and-other-KubeVirt-component-images for the above upgrade solution.

Thanks, @w13915984028, for giving the patch. I tested it, and it looks well.

And I will add a document for this, too. (Not only on the wiki, but also document should mention this)

Vicente-Cheng commented 10 months ago

After discussion, we thought the wiki information should be enough. Most people would not patch kubevirt manually.

Let's move forward.

harvesterhci-io-github-bot commented 10 months ago

Pre Ready-For-Testing Checklist

~* [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:~

Test steps as below:

  1. create 3 nodes v1.1.1 cluster
  2. patch the virt-controller as step 4 on https://github.com/harvester/harvester/wiki/Replace-KubeVirt-virt-controller-and-other-KubeVirt-component-images
  3. Create VM
  4. Try to patch the harvester managedchart, like step 3 on https://github.com/harvester/harvester/wiki/Replace-KubeVirt-virt-controller-and-other-KubeVirt-component-images
  5. Try to upgrade. (No matter v1.1.2 or v1.1.3 would be fine)
  6. Make sure the upgrade is successful.

~* [ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:~

~ [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including `backport-needed/`)? The PR is at:~

~* [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:~

~* [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:~

~* [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

~* [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:~

harvesterhci-io-github-bot commented 10 months ago

Automation e2e test issue: harvester/tests#1020

irishgordo commented 8 months ago

@Vicente-Cheng thanks for the mention :+1: :smile:

following the test-plan things look good :smile:

Waiting for ManagedChart fleet-local/harvester-crd from generation 2
Target version: 1.1.2, Target state: ready
Current version: 1.1.2, Current state: null, Current generation: 4
Waiting for KubeVirt to upgraded to 0.54.0-150400.3.10.4...
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
KubeVirt current version: 0.54.0-150400.3.7.1, target version: 0.54.0-150400.3.10.4
Waiting for LH settling down...
Waiting for longhorn-manager to be upgraded...
Checking instance-manager-r pod on node harvester-node-0...

Additionally, as a smoke-test:

Validated that:

Configured with:

sudo sysctl -w vm.max_map_count=262144
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e xpack.security.enabled=false -e node.name=es01 -it docker.elastic.co/elasticsearch/elasticsearch:6.8.23
docker run -d --name kibana --link elasticsearch:es_alias --env "ELASTICSEARCH_URL=http://es_alias:9200" -p 5601:5601 -it docker.elastic.co/kibana/kibana:6.8.23

ElasticSearch: 6.8.23 Kibana: 6.8.23

With ElasticSearch Index And User built like attached postman loadout: Sample ElasticSearch Setup.postman_collection.json

Do not cause any issues. v1.2.1 -> v1.2-head -> v1.3-head hvst-upgrade-n9nhz-upgradelog-archive-2024-03-04T22-55-56Z.zip hvst-upgrade-q8dwv-upgradelog-archive-2024-03-04T20-44-41Z.zip

Screenshot from 2024-03-04 17-44-42 Screenshot from 2024-03-04 15-03-44

I'll go ahead and close this out :smile: