rh-aiservices-bu / parasol-insurance

Source for the "Parasol Insurance" Lab
https://rh-aiservices-bu.github.io/parasol-insurance/
MIT License
14 stars 5 forks source link

mariadb deployment for Pipelines may crash #67

Open guimou opened 2 months ago

guimou commented 2 months ago

Brand new deployment Everything is fine except for one user: image

Error is:

...
2024-04-19 12:27:51 0 [Note] Crash recovery finished.
2024-04-19 12:27:51 0 [ERROR] Missing system table mysql.proxies_priv; please run mysql_upgrade to create it
2024-04-19 12:27:51 0 [Warning] Can't open and lock time zone table: Table 'mysql.time_zone_leap_second' doesn't exist trying to live without them
2024-04-19 12:27:51 0 [ERROR] Cannot open mysql.event
2024-04-19 12:27:51 0 [ERROR] mysqld: Event Scheduler: An error occurred when initializing system tables. Disabling the Event Scheduler.
2024-04-19 12:27:51 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos' doesn't exist
2024-04-19 12:27:51 0 [Note] Reading of all Master_info entries succeeded
2024-04-19 12:27:51 0 [Note] Added new Master_info '' to hash table
2024-04-19 12:27:51 0 [Note] /usr/libexec/mysqld: ready for connections.
Version: '10.3.39-MariaDB'  socket: '/tmp/mysql.sock'  port: 0  MariaDB Server
2024-04-19 12:27:52 8 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
---> 12:27:52     MySQL started successfully
=> sourcing 40-datadir-action.sh ...
---> 12:27:52     Running datadir action: upgrade-warn
---> 12:27:52     MySQL server version check passed, both server and data directory are version 10.3.
=> sourcing 50-passwd-change.sh ...
---> 12:27:52     Setting passwords ...
2024-04-19 12:27:52 9 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
---> 12:27:52     WARNING: User mlpipeline does not exist in database. Password not changed.
2024-04-19 12:27:52 10 [Warning] Access denied for user 'root'@'localhost' (using password: NO)
ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
guimou commented 2 months ago

Workaround:

RHRolun commented 2 months ago

Some observations/notes:

guimou commented 2 months ago

Yes, an initContainer might be appropriate.

On Fri., Apr. 19, 2024, 10:26 RHRolun, @.***> wrote:

Some observations/notes:

  • It's user1 that fails, so the first user that gets created, maybe something is not ready on the cluster by the time it's starting.
  • It seems to be something with the MySQL server, maybe we can check that it's fully running before starting the pipeline job

— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2066700562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YVR2R2NAKZBFSAA3WDY6ESSBAVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWG4YDANJWGI . You are receiving this because you authored the thread.Message ID: @.***>

erwangranger commented 2 months ago

I booked an env with 50 users. only user1 fails.

in the event log:

MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" : 
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: 
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers

and

MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" : 
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: 
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers

scale to zero and back to 1 does not help.

delete and recreate the DSPA from the UI works a charm.

RHRolun commented 2 months ago

Recreating the DataSciencePipelinesApplication resource also fixes the issue

erwangranger commented 2 months ago

can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?

asking for a friend. :-)

guimou commented 2 months ago

Or we block user1 after the lab is deployed...

On Fri., Apr. 19, 2024, 18:53 Erwan Granger, @.***> wrote:

can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?

asking for a friend. :-)

— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2067371214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YWLG5GFLVNUTVU2E3TY6GN63AVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGM3TCMRRGQ . You are receiving this because you authored the thread.Message ID: @.***>

erwangranger commented 2 months ago

In new env , out of 50 users, ... all mariadbs are working, but for some reason, user26 has a VolumeMount issue in the notebook. Could it be the first PVC asked in the storage class ends up failing, and all the other ones after work?

RHRolun commented 2 months ago

Could it be so that the storage class is not properly set up yet? Although it's strange that only a single one fails in that case. Do you happen to have the logs? Edit, this is the error: MountVolume.MountDevice failed for volume "pvc-25d2bbe0-05f2-43a6-abdb-c28bc74fbcf4" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-77863682-a89f-4708-96c0-c5f000b5dd17 is still being used

RHRolun commented 2 months ago

Restarting the pod did not fix the PVC issue. Removing the PVC and re-creating it fixed it. Only hint I could find is if the rbd image would be blocked, but I'm not sure how to inspect that yet. Spinning up environments to see if I can reproduce both issues.

RHRolun commented 2 months ago

Got some help from engineering and by the looks of it there is a database corruption which they think comes from kube giving out dirty PVCs in this case. I also saw a couple of cases where the PVC attached to the workbench was causing issues and the workbench not starting. The tools I added should help us with getting things set up smoothly.

To fix the automation, I could have something that waits for the mariadb and workbench pod to either start running, time out, or crash - although it feels like there can be a few false positives in a job like that. What would be best practice here?