Open guimou opened 2 months ago
Workaround:
/bootstrap/ic-user-projects/create-projects-and-resources.bash
. You simply have to adapt it to remove the loop and set the user name.
@RHRolun Maybe worth having another version of this for single projects?Some observations/notes:
Yes, an initContainer might be appropriate.
On Fri., Apr. 19, 2024, 10:26 RHRolun, @.***> wrote:
Some observations/notes:
- It's user1 that fails, so the first user that gets created, maybe something is not ready on the cluster by the time it's starting.
- It seems to be something with the MySQL server, maybe we can check that it's fully running before starting the pipeline job
— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2066700562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YVR2R2NAKZBFSAA3WDY6ESSBAVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWG4YDANJWGI . You are receiving this because you authored the thread.Message ID: @.***>
I booked an env with 50 users. only user1 fails.
in the event log:
MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" :
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient:
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
and
MountVolume.MountDevice failed for volume "pvc-37a3562e-9bf7-4749-affd-3990cc7674e5" :
kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient:
driver name openshift-storage.rbd.csi.ceph.com not found in the list of registered CSI drivers
scale to zero and back to 1 does not help.
delete and recreate the DSPA from the UI works a charm.
Recreating the DataSciencePipelinesApplication resource also fixes the issue
can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?
asking for a friend. :-)
Or we block user1 after the lab is deployed...
On Fri., Apr. 19, 2024, 18:53 Erwan Granger, @.***> wrote:
can we loop from user N to 1 instead of 1 to N, so only the last user has a screwy mariadb?
asking for a friend. :-)
— Reply to this email directly, view it on GitHub https://github.com/rh-aiservices-bu/parasol-insurance/issues/67#issuecomment-2067371214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6C4YWLG5GFLVNUTVU2E3TY6GN63AVCNFSM6AAAAABGPDCE6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGM3TCMRRGQ . You are receiving this because you authored the thread.Message ID: @.***>
In new env , out of 50 users, ... all mariadbs are working, but for some reason, user26 has a VolumeMount issue in the notebook. Could it be the first PVC asked in the storage class ends up failing, and all the other ones after work?
Could it be so that the storage class is not properly set up yet? Although it's strange that only a single one fails in that case. Do you happen to have the logs? Edit, this is the error: MountVolume.MountDevice failed for volume "pvc-25d2bbe0-05f2-43a6-abdb-c28bc74fbcf4" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-77863682-a89f-4708-96c0-c5f000b5dd17 is still being used
Restarting the pod did not fix the PVC issue. Removing the PVC and re-creating it fixed it. Only hint I could find is if the rbd image would be blocked, but I'm not sure how to inspect that yet. Spinning up environments to see if I can reproduce both issues.
Got some help from engineering and by the looks of it there is a database corruption which they think comes from kube giving out dirty PVCs in this case. I also saw a couple of cases where the PVC attached to the workbench was causing issues and the workbench not starting. The tools I added should help us with getting things set up smoothly.
To fix the automation, I could have something that waits for the mariadb and workbench pod to either start running, time out, or crash - although it feels like there can be a few false positives in a job like that. What would be best practice here?
Brand new deployment Everything is fine except for one user:![image](https://github.com/rh-aiservices-bu/parasol-insurance/assets/3944034/0998dc8b-9f2c-44f8-860e-1428b008c280)
Error is: