Live-Migration/ migration issues

Nils98Ar commented 1 year ago

Hi,

every time we evacuate a host (as preperation for maintenance of the host) there are some instances that make problems. Every time for different reasons I think... :)

This time there were 5 instances that we could not live-migrate. When we tried to click on the instance in horizon there was only a generic "something went wrong" error page. Two of them were K8s worker nodes created by the SCS k8s-cluster-api-provider. But all the other K8s worker nodes could be migrated without any problems.

These "volume not found" errors were the first ones in the nova-compute.log on the source host while running the live-migrations. The openstack migrate command as well as horizon returned no error but the migration did not happen. The instances remained running on the source host:

2023-08-23 18:52:00.142 8 ERROR nova.compute.manager [-] [instance: d51e7a64-e47c-4a1b-8022-4776788ddd0e] Pre live migration failed at cvi09902: nova.exception_Remote.VolumeNotFound_Remote: Volume 44785f86-7fde-45fe-911a-58a3065e4cd8 could not be found.

2023-08-23 18:53:46.933 8 ERROR nova.volume.cinder [None req-431d1918-7734-44a5-8f90-6fa68c01cf3c 19cef1e468cf4089873b262f5e58c629 e5bb574f8f8d4c72bc216801b6cd77de - - default default] [instance: d51e7a64-e47c-4a1b-8022-4776788ddd0e] Create attachment failed for volume 44785f86-7fde-45fe-911a-58a3065e4cd8. Error: Volume 44785f86-7fde-45fe-911a-58a3065e4cd8 could not be found. (HTTP 404) (Request-ID: req-9cd37529-f96c-491b-8a2e-11346fcb851b) Code: 404: cinderclient.exceptions.NotFound: Volume 44785f86-7fde-45fe-911a-58a3065e4cd8 could not be found. (HTTP 404) (Request-ID: req-9cd37529-f96c-491b-8a2e-11346fcb851b)

2023-08-23 18:53:47.738 8 ERROR oslo_messaging.rpc.server [None req-431d1918-7734-44a5-8f90-6fa68c01cf3c 19cef1e468cf4089873b262f5e58c629 e5bb574f8f8d4c72bc216801b6cd77de - - default default] Exception during message handling: nova.exception.VolumeNotFound: Volume 44785f86-7fde-45fe-911a-58a3065e4cd8 could not be found.

Later there were these errors, probably after stopping and trying to migrate the instance. At this point the instance always went to error state when trying to start/hard reboot it.

2023-08-23 19:02:10.886 7 WARNING nova.virt.libvirt.driver [None req-dc939878-307f-4104-98a4-2029c717140f - - - - - -] An error occurred while updating compute node resource provider status to "enabled" for provider: 877ed2d5-bd04-4134-8e87-d22e176377e7: ValueError: No such provider 877ed2d5-bd04-4134-8e87-d22e176377e7

2023-08-23 23:10:24.792 7 ERROR oslo_messaging.rpc.server [req-43543069-53f9-4649-9fac-0d56e0b50fe2 req-104ce14f-5f97-4181-a7fc-59b35e3bdbc1 4d8994e365084b5b891eb8278f128477 91c343c9091f4b8d9aacd4262f2560ae - - default
default] Exception during message handling: libvirt.libvirtError: Unable to write to monitor: Broken pipe

2023-08-24 09:08:55.728 7 ERROR oslo_messaging.rpc.server [req-36a2bce3-e59d-4237-b0c6-8766a14b0d79 req-b72b754c-095e-4e68-a94a-5ab4d892d958 4d8994e365084b5b891eb8278f128477 91c343c9091f4b8d9aacd4262f2560ae - - default
default] Exception during message handling: libvirt.libvirtError: internal error: End of file from qemu monitor (vm='instance-000016b7')

We found no way to get them running on any host and it was faster to delete and recreate the 5 instances.

Do you have any idea why these errors could happen? The 5 instances where running without any issues before. For some reason OpenStack expected a volume to be available that was not there? And then the qemu errors...

artificial-intelligence commented 1 year ago

I personally have seen at least similar errors with an k8s provider creating openstack vms, where the volumes where changed later by the k8s operator but the change didn't apply cleanly, because of temporary openstack api downtime.

you could check this yourself by looking at all the logs for all api calls regarding this VM and all it's volumes. often one can see failed api calls when some k8s provider detaches/reattaches a volume to a vm but just goes on without acting on the returned error code.

this might result in stale volumes that are still attached from novas perspective but are in fact no longer attached, because the k8s provider removed them.

when the live migration occurs, the new instance on the new hypervisor of course can't attach the old volume any more because it was already deleted, but nova didn't get the message.

this can be worked around most of the time by making sure that the k8s providers are in maintenance mode themselves when the openstack apis are unavailable during maintenance.

another thing to closely monitor is uptime of the various openstack apis and making sure no third party bugs (like network equipment) leads to unnecessary downtimes, which can also happen.

But first you should investigate why the volumes where not there when nova thinks they are.

api downtime is only one reason, an operator error is another, that is, wrongfully manual deletion, there are more reasons.

looking at the api logs for the volume and vm uuids should tell you what actions where taken which lead to this behaviour.

Nils98Ar commented 1 year ago

@artificial-intelligence First of all thank you!

Does this also explain the "something went wrong" error page when trying to open the instance details in horizon?

Do you have any idea how to manually fix this issue if it occurs again? For now we have deleted and recreated the instances…

artificial-intelligence commented 1 year ago

@Nils98Ar I'm not personally doing much with horizon because of exactly these kinds of errors, which tend to not happen when interacting with the CLI. In general I find the CLI UX much more reliable and would advise to use it wherever possible. I realize this might not be an option for non technical end users or in otherwise constrained environments.

I think the horizon web UI might not be able to handle api errors, e.g. when a VM reports it's volumes missing. I could imagine that these error paths are not that well tested, sadly.

It might be good to open an upstream bug, if you have specific error messages when following a specific workflow, which can reproduce the issue.

Nils98Ar commented 1 year ago

Currently we have no more "faulty" instances but I will come back to it next time ;)

Nils98Ar commented 1 year ago

@artificial-intelligence Do you have any idea where we can fix this inconsistency if it occurs again? In the nova database?

berendt commented 11 months ago

Looks like this is done. Please re-open if you have the error again in the future.

osism / issues

Live-Migration/ migration issues #645