openstack-k8s-operators / nova-operator

Apache License 2.0
10 stars 44 forks source link

In the event a guest cannot be scheduled, attempting to store in cell0 fails due to database username #449

Closed jamepark4 closed 1 year ago

jamepark4 commented 1 year ago

In the event that a guest fails to schedule to a host and is stored in cell0 the procedure will fail due to the default database username:

2023-07-10 21:01:56.629 1 ERROR nova.context pymysql.err.OperationalError: (1044, "Access denied for user 'nova_cell0'@'%' to database 'nova_cell0_cell0'")

It currently appears that default deployment sets cellDatabaseUser:

[stack@sriov01 ~]$ oc get novacell/nova-cell0 -o yaml
apiVersion: nova.openstack.org/v1beta1
kind: NovaCell
metadata:
  creationTimestamp: "2023-07-10T20:25:45Z"
  generation: 1
  name: nova-cell0
  namespace: openstack
  ownerReferences:
  - apiVersion: nova.openstack.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Nova
    name: nova
    uid: 1a841562-0239-46c5-803d-417eacb268c4
  resourceVersion: "52531"
  uid: 1b2d6292-543b-4ef5-b65e-58ea90574c2e
spec:
  apiDatabaseHostname: openstack
  apiDatabaseUser: nova_api
  cellDatabaseHostname: openstack
  cellDatabaseUser: nova_cell0
  cellMessageBusSecretName: rabbitmq-transport-url-nova-api-transport
  cellName: cell0
  conductorServiceTemplate:
    containerImage: quay.io/podified-antelope-centos9/openstack-nova-conductor:current-podified
    customServiceConfig: ""
    replicas: 1
    resources: {}

Full logs attached below with server creation uuid being: req-68ec3082-2ed4-4519-954f-887a6dfe7b74

guest_fails_to_schedule.log

gibizer commented 1 year ago

Which environment do you use for this? How does your OpenstackControlPlane CR look like?

I tried to reproduce it but failed. What I did:

  1. start a new crc instance
  2. make crc_attach_default_interface # in devsetup
  3. for suffix in {0..1} ; do EDPM_COMPUTE_SUFFIX=$suffix make edpm_compute && sleep 60 && EDPM_COMPUTE_SUFFIX=$suffix make edpm_compute_repos ; done # in devsetup to start that the EDPM VMs
  4. make crc_storage
  5. make openstack
  6. make openstack_deploy
  7. DATAPLANE_SINGLE_NODE=false make edpm_deploy

Then I created a falvor with 10 vcpus that does not fit to any of the computes as each has 2 vcpus only. Then I tried to boot a VM with that flavor. The boot failed with NoValidHost as expected. I don't see any db errors in the conductor logs and I see the instance stored in the cell0 DB properly.

gibizer commented 1 year ago

From your error message nova_cell0_cell0 seems wrong.

gibizer commented 1 year ago

What is the output of the command in your env? oc rsh nova-cell0-conductor-0 nova-manage cell_v2 list_cells In mine:

Modules with known eventlet monkey patching issues were imported prior to eventlet monkey patching: urllib3. This warning can usually be ignored if the caller is only importing and not executing nova code.
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
|  Name |                 UUID                 |                                  Transport URL                                   |                    Database Connection                     | Disabled |
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
| cell0 | 00000000-0000-0000-0000-000000000000 |                                     rabbit:                                      |    mysql+pymysql://nova_cell0:****@openstack/nova_cell0    |  False   |
| cell1 | fe34f679-292c-4460-9de7-6d06d9a57fca | rabbit://default_user_wVHi3_Bu6QYOIVso2pB:****@rabbitmq-cell1.openstack.svc:5672 | mysql+pymysql://nova_cell1:****@openstack-cell1/nova_cell1 |  False   |
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
jamepark4 commented 1 year ago

I was using defaults with the ci_framework and directly into tempest with default concurrency. The defaults for the computes are well below what is acceptable for running tempest. I've redeployed with two computes that fit the tempest recommendations and while I'm still hitting some failure to schedule I am no longer seeing the database error with this environment. I'll let you know if I can recreate the failure with this current environment or use the approach you are using when deploying. Below is the cell_v2 list details of the environment.

[stack@sriov01 nova_logs]$  oc rsh nova-cell0-conductor-0 nova-manage cell_v2 list_cells
Modules with known eventlet monkey patching issues were imported prior to eventlet monkey patching: urllib3. This warning can usually be ignored if the caller is only importing and not executing nova code.
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
|  Name |                 UUID                 |                                  Transport URL                                   |                    Database Connection                     | Disabled |
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
| cell0 | 00000000-0000-0000-0000-000000000000 |                                     rabbit:                                      |    mysql+pymysql://nova_cell0:****@openstack/nova_cell0    |  False   |
| cell1 | de13b4c2-ddc9-45af-8e58-f7f232172b17 | rabbit://default_user_tbCFe7oorPMk_iouX5j:****@rabbitmq-cell1.openstack.svc:5672 | mysql+pymysql://nova_cell1:****@openstack-cell1/nova_cell1 |  False   |
+-------+--------------------------------------+----------------------------------------------------------------------------------+------------------------------------------------------------+----------+
[stack@sriov01 nova_logs]$ 
gibizer commented 1 year ago

The cell mapping you copied above is correct, there is no nova_cell0_cell0 mentioned there. If you still have this issue or if you see it again it would be nice to look around a in the env. I assume it is a wrong database connection config in the nova-cell-conductor statefulset as we ruled out the cell mapping above. But it is really strange that the job created the mapping had a good config while the conductor doesn't have a good config in the same deployment.

gibizer commented 1 year ago

Feel free to repoen it if you see it again.