vertica / vertica-kubernetes

Operator, container and Helm chart to deploy Vertica in Kubernetes
Apache License 2.0
44 stars 25 forks source link

Reinstall packages during upgrade #717

Closed jizhuoyu closed 8 months ago

jizhuoyu commented 8 months ago

In Vertica, upgrading the server version requires reinstalling packages because they are tied to a specific server version. This process was handled automatically during admintools deployments, but was not implemented for vclusterOps deployments. To address this issue, we have modified the upgrade process to include a package reinstallation step after restarting Vertica with the new version.

jizhuoyu commented 8 months ago

As discussed, we observe e2e test error due to not enough disk space:

spilchen commented 8 months ago

I saw that the new e2e test still fails because there isn't enough disk space. So, I started to look at the disk space usage of a database.

  1. New databases created without package install: communal 8K, each node is 11MB
  2. New databases created with packages installed: communal: 162MB, each node is 335MB.
  3. When upgrading 2., the disk space usage after the upgrade is: communal 323MB, each node is 394MB. Although I did see the per node usage spike to 576MB before it went down.

Here is my suggestion for an attempt at fixing it:

jizhuoyu commented 8 months ago

I saw that the new e2e test still fails because there isn't enough disk space. So, I started to look at the disk space usage of a database.

  1. New databases created without package install: communal 8K, each node is 11MB
  2. New databases created with packages installed: communal: 162MB, each node is 335MB.
  3. When upgrading 2., the disk space usage after the upgrade is: communal 323MB, each node is 394MB. Although I did see the per node usage spike to 576MB before it went down.

Here is my suggestion for an attempt at fixing it:

  • change the e2e tests to use the initPolicy of CreateSkipPackageInstall and remove the package verification steps. This will keep the disk requirements low.
  • keep running the tests serially
  • add a 2 new test to the same leg that will verify package install through upgrade. One for online and one for offline upgrade. The only difference is that this test will only be for a single node. We probably only need to do 1 upgrade here to verify things are working.

I tested in one commit where the install packages steps are still kept in the 2 original upgrade tests. The results of all 4 tests are as follows:

  1. offline install and upgrade once: 23.4->24.1 failed
  2. online install and upgrade once: 23.4->24.1 failed
  3. online install and upgrade 3 times: 12.0.4->23.4 succeeded, 23.4->24.1 failed, 24.1->latest no chance to run
  4. offline install and upgrade 3 times: package install verification right after first createdb for 12.0.4 failed, all 3 upgrades later have no chance to run

now that with the latest commits we essentially have 2 tests only for upgrade (3 times) and 2 tests for upgrade and install (once from 23.4->24.1). we passed the former 2 tests and failed the latter 2.

I guess maybe we could pass if we upgrade and install from 12.0.4->23.4 rather than 23.4->24.1 for the latter 2 tests, however this means that we are testing install for admintools only.

spilchen commented 8 months ago

Thanks for trying these experiments. It doesn't look like we'll be able to automate your tests on account of the disk space constraint. Manual verification will have to do for now. Can you remove those two new tests you added? We can add back the parallelism to leg 8 as well. Can we get the other tests in e2e leg 8 back to what they were before. I think you removed one of the upgrade versions.

jizhuoyu commented 8 months ago

Thanks for trying these experiments. It doesn't look like we'll be able to automate your tests on account of the disk space constraint. Manual verification will have to do for now. Can you remove those two new tests you added? We can add back the parallelism to leg 8 as well. Can we get the other tests in e2e leg 8 back to what they were before. I think you removed one of the upgrade versions.

As discussed, leg 8 now has 2 tests remaining (upgrade 3 times from 12.0.4 to 23.4 to 24.1 to latest) for both online and offline upgrade where we have CreateSkipPackageInstall as the initPolicy. I added several steps waiting for condition=UpgradeInProgress=False as I think it's good to confirm that we actually pass the last step of an upgrade. Besides, comments are added in setup-vdb.yaml to clearly state the reason why we are skipping package install. @spilchen