piraeusdatastore / piraeus-operator

The Piraeus Operator manages LINSTOR clusters in Kubernetes.
https://piraeus.io/
Apache License 2.0
387 stars 61 forks source link

CS Controller crashes while loading data from etcd #439

Closed tobg closed 1 year ago

tobg commented 1 year ago

Hello,

we have the following issue and are not able to locate the cause or fix it. It happened out of nowhere. We did no updates or any changes on our k8s infra for more than a week.

time="2023-03-23T21:19:17Z" level=info msg="running k8s-await-election" version=refs/tags/v0.3.1 I0323 21:19:17.987798 1 leaderelection.go:248] attempting to acquire leader lease piraeus-system/piraeus-op-cs... I0323 21:19:18.038712 1 leaderelection.go:258] successfully acquired lease piraeus-system/piraeus-op-cs time="2023-03-23T21:19:18Z" level=info msg="long live our new leader: 'piraeus-op-cs-controller-75876d475f-2dl7t'!" time="2023-03-23T21:19:18Z" level=info msg="starting command '/usr/bin/piraeus-entry.sh' with arguments: '[startController]'" LINSTOR, Module Controller Version: 1.20.0 (9c6f7fad48521899f7a99c564b1d33aeacfdbfa8) Build time: 2022-10-18T07:19:30+00:00 Java Version: 11 Java VM: Debian, Version 11.0.16+8-post-Debian-1deb11u1 Operating system: Linux, Version 5.15.0-56-generic Environment: amd64, 1 processors, 7750 MiB memory reserved for allocations

System components initialization in progress

Loading configuration file "/etc/linstor/linstor.toml" 21:19:19.243 [main] INFO LINSTOR/Controller - SYSTEM - ErrorReporter DB first time init. 21:19:19.246 [main] INFO LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller' 21:19:19.287 [main] INFO LINSTOR/Controller - SYSTEM - Database type is etcd 21:19:19.288 [Main] INFO LINSTOR/Controller - SYSTEM - Loading API classes started. 21:19:19.871 [Main] INFO LINSTOR/Controller - SYSTEM - API classes loading finished: 582ms 21:19:19.871 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection started. 21:19:19.888 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" 21:19:19.889 [Main] INFO LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed 21:19:19.889 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" 21:19:19.902 [Main] INFO LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful 21:19:19.902 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" 21:19:19.903 [Main] INFO LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" was successful 21:19:21.129 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection finished: 1258ms 21:19:21.130 [Main] INFO LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module 21:19:21.480 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing authentication subsystem 21:19:21.824 [Main] INFO LINSTOR/Controller - SYSTEM - SpaceTracking using ETCD driver 21:19:21.827 [Main] INFO LINSTOR/Controller - SYSTEM - SpaceTrackingService: Instance added as a system service 21:19:21.828 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService 21:19:21.829 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing the etcd database 21:19:21.829 [Main] INFO LINSTOR/Controller - SYSTEM - etcd connection URL is "etcd://piraeus-op-etcd:2379" 21:19:22.679 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'ETCDDatabaseService' of type ETCDDatabaseService 21:19:22.682 [Main] INFO LINSTOR/Controller - SYSTEM - Loading security objects 21:19:22.866 [Main] INFO LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY 21:19:23.111 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress 21:19:42.263 [Main] ERROR LINSTOR/Controller - SYSTEM - Problem of type 'java.lang.NullPointerException' logged to report number 641CC256-00000-000000

21:19:42.271 [Main] ERROR LINSTOR/Controller - SYSTEM - Unhandled exception [Report number 641CC256-00000-000001]

21:19:42.273 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutdown in progress 21:19:42.274 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'EbsStatusPoll' of type EbsStatusPoll 21:19:42.275 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'EbsStatusPoll' to complete shutdown 21:19:42.276 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'ScheduleBackupService' of type ScheduleBackupService 21:19:42.276 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'ScheduleBackupService' to complete shutdown 21:19:42.277 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'SpaceTrackingService' of type SpaceTrackingService 21:19:42.277 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'SpaceTrackingService' to complete shutdown 21:19:42.277 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'TaskScheduleService' of type TaskScheduleService 21:19:42.277 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'TaskScheduleService' to complete shutdown 21:19:42.277 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'ETCDDatabaseService' of type ETCDDatabaseService 21:19:42.281 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'ETCDDatabaseService' to complete shutdown 21:19:42.281 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutting down service instance 'TimerEventService' of type TimerEventService 21:19:42.282 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Waiting for service instance 'TimerEventService' to complete shutdown 21:19:42.282 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutdown complete time="2023-03-23T21:19:42Z" level=fatal msg="failed to run" err="exit status 20"

WanzenBug commented 1 year ago

Hmm, this might be related to an issue identified in Linstor. The recent 1.21.1 should fix this. We haven't build an image for that yet, but you can build it yourself from the piraeus repo.

tobg commented 1 year ago

Hi, thank you. Im not sure how to do this. I know how to build images from Dockerfile but I cannot find anything in the repo that makes sense to me. Can you please give me some hints?

Is the following a good start? https://quay.io/repository/piraeusdatastore/piraeus-server/manifest/sha256:607c8f4f44ef8678c2be9a09877cbf8a11d4413074ec13714925d4b96ba585fe

WanzenBug commented 1 year ago

Updated images will be available soon: https://github.com/piraeusdatastore/piraeus/pull/136

tobg commented 1 year ago

Thank you very much. I really appreciate it! :)

I now get a new error:

11:08:35.180 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress 11:08:58.133 [Main] ERROR LINSTOR/Controller - SYSTEM - Database entry of table NODE_STOR_POOL could not be restored. [Report number 64217921-00000-000000]

11:08:58.148 [Main] ERROR LINSTOR/Controller - SYSTEM - Unhandled exception [Report number 64217921-00000-000001]

WanzenBug commented 1 year ago

Well, I guess that is progress :disappointed:

You may try your luck on the linstor-server github page or in the community slack

tobg commented 1 year ago

Thank you!

I asked at linstor-server and tried older backups. It seams as if this issue has existed for a while and only had an impact after the controller had restarted. Unfortunately, no old backup works.

The backup size of the etcd database is 1.7Gb, quit large, I guess, for about 50 pvc.

tobg commented 1 year ago

This issue was solved by adding some missing keys to etcd. Why the keys were missing is unknown, as we never used linstor 1.20.1 or 1.20.2, which had a known bug that deleted too much data in etcd.