mariadb deployment for Pipelines may crash

guimou commented 2 months ago

Brand new deployment Everything is fine except for one user:

Error is:

...
2024-04-19 12:27:51 0 [Note] Crash recovery finished.
2024-04-19 12:27:51 0 [ERROR] Missing system table mysql.proxies_priv; please run mysql_upgrade to create it
2024-04-19 12:27:51 0 [Warning] Can't open and lock time zone table: Table 'mysql.time_zone_leap_second' doesn't exist trying to live without them
2024-04-19 12:27:51 0 [ERROR] Cannot open mysql.event
2024-04-19 12:27:51 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.
2024-04-19 12:27:51 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos' doesn't exist
2024-04-19 12:27:51 0 [Note] Reading of all Master_info entries succeeded
2024-04-19 12:27:51 0 [Note] Added new Master_info '' to hash table
2024-04-19 12:27:51 0 [Note] /usr/libexec/mysqld: ready for connections.
Version: '10.3.39-MariaDB'  socket: '/tmp/mysql.sock'  port: 0  MariaDB Server
2024-04-19 12:27:52 8 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
---> 12:27:52     MySQL started successfully
=> sourcing 40-datadir-action.sh ...
---> 12:27:52     Running datadir action: upgrade-warn
---> 12:27:52     MySQL server version check passed, both server and data directory are version 10.3.
=> sourcing 50-passwd-change.sh ...
---> 12:27:52     Setting passwords ...
2024-04-19 12:27:52 9 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
---> 12:27:52     WARNING: User mlpipeline does not exist in database. Password not changed.
2024-04-19 12:27:52 10 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

guimou commented 2 months ago

Workaround:

Delete the whole project (I tried restarting some Pods, delete and recreate only the DSPA, no success)
Recreate the whole project using the script in /bootstrap/ic-user-projects/create-projects-and-resources.bash. You simply have to adapt it to remove the loop and set the user name. @RHRolun Maybe worth having another version of this for single projects?

RHRolun commented 2 months ago

Some observations/notes:

It's user1 that fails, so the first user that gets created, maybe something is not ready on the cluster by the time it's starting.
It seems to be something with the MySQL server, maybe we can check that it's fully running before starting the pipeline job

guimou commented 2 months ago

Yes, an initContainer might be appropriate.

On Fri., Apr. 19, 2024, 10:26 RHRolun, @.***> wrote:

Some observations/notes:

It's user1 that fails, so the first user that gets created, maybe something is not ready on the cluster by the time it's starting.

It seems to be something with the MySQL server, maybe we can check that it's fully running before starting the pipeline job

— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2066700562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YVR2R2NAKZBFSAA3WDY6ESSBAVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWG4YDANJWGI . You are receiving this because you authored the thread.Message ID: @.***>

erwangranger commented 2 months ago

I booked an env with 50 users. only user1 fails.

in the event log:

MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" : 
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: 
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers

and

MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" : 
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: 
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers

scale to zero and back to 1 does not help.

delete and recreate the DSPA from the UI works a charm.

RHRolun commented 2 months ago

Recreating the DataSciencePipelinesApplication resource also fixes the issue

erwangranger commented 2 months ago

can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?

asking for a friend. :-)

guimou commented 2 months ago

Or we block user1 after the lab is deployed...

On Fri., Apr. 19, 2024, 18:53 Erwan Granger, @.***> wrote:

can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?

asking for a friend. :-)

— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2067371214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YWLG5GFLVNUTVU2E3TY6GN63AVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGM3TCMRRGQ . You are receiving this because you authored the thread.Message ID: @.***>

erwangranger commented 2 months ago

In new env , out of 50 users, ... all mariadbs are working, but for some reason, user26 has a VolumeMount issue in the notebook. Could it be the first PVC asked in the storage class ends up failing, and all the other ones after work?

RHRolun commented 2 months ago

Could it be so that the storage class is not properly set up yet? Although it's strange that only a single one fails in that case. Do you happen to have the logs? Edit, this is the error: MountVolume.MountDevice failed for volume "pvc-25d2bbe0-05f2-43a6-abdb-c28bc74fbcf4" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-77863682-a89f-4708-96c0-c5f000b5dd17 is still being used

RHRolun commented 2 months ago

Restarting the pod did not fix the PVC issue. Removing the PVC and re-creating it fixed it. Only hint I could find is if the rbd image would be blocked, but I'm not sure how to inspect that yet. Spinning up environments to see if I can reproduce both issues.

RHRolun commented 2 months ago

Got some help from engineering and by the looks of it there is a database corruption which they think comes from kube giving out dirty PVCs in this case. I also saw a couple of cases where the PVC attached to the workbench was causing issues and the workbench not starting. The tools I added should help us with getting things set up smoothly.

To fix the automation, I could have something that waits for the mariadb and workbench pod to either start running, time out, or crash - although it feels like there can be a few false positives in a job like that. What would be best practice here?

rh-aiservices-bu / parasol-insurance

mariadb deployment for Pipelines may crash #67