kubeflow / kubeflow

Machine Learning Toolkit for Kubernetes
https://www.kubeflow.org/
Apache License 2.0
14.33k stars 2.4k forks source link

Authservice pod "Failed to save state in store: error trying to save session: input/output error" #7042

Closed psheorangithub closed 5 months ago

psheorangithub commented 1 year ago

/kind bug

We have integrated kubeflow with OIDC flow(Heracles+LDAP)We are unable to login to kubeflow UI. GUI throws below error.

Access to kubeflow.aiwb-enc-data-cpu1.uscentral-prd-az3.k8s.int was deniedYou don't have authorization to view this page. HTTP ERROR 403

While checking the authservice pod logs, I see below error. It happens every couple of days.

2023/03/15 14:08:40 boltstore: remove expired sessions error: input/output error time="2023-03-15T14:06:29Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip= request=/

The issue resolves after restarting authservice pod but it re-appear after every 10-15 days. We have checked the underlying PVC status, it looks healthy.

Can someone look into it and suggest what could be the cause

Environment:

MiraiChino commented 1 year ago

We are experiencing the same issue with our Kubeflow deployment that uses Dex and LDAP for authentication. Upon attempting to access the Kubeflow URL, we face a recurring 403 error. The Kubeflow environment has 6 users, and they encounter this issue on the Kubeflow login page approximately every 4 to 5 days.

The following error is observed in the logs of the authservice-0 pod within the istio-system namespace:

# kubectl logs authservice-0 -n istio-system
...
{"log":"time="..." level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=... request=/jupyter/api/namespaces/.../notebooks\n","stream":"stderr","time":"..."}

We have inspected the resource usage of the authservice-0 pod and found no anomalies. CPU and memory consumption appear to be normal:

# kubectl top pod authservice-0 -n istio-system
NAME            CPU(cores)   MEMORY(bytes)   
authservice-0   1m           8Mi     

We have also checked the status of the persistent volume claim (PVC) associated with the authservice-0 pod and found that it contains the expected data:

# ls -lh /export/kubernetes/istio-system-authservice-pvc-pvc-4a8228e9-2b6d-456c-9a7a-f9fbf8d9a209/
total 112K
-rw-r--r-- 1 systemd-network tss 128K Jul 19 09:06 data.db

As a temporary workaround, we have tried deleting the authservice-0 pod, and it automatically gets recreated. After this, the Kubeflow Dex login page is accessible again.

Environment:

Any insights or suggestions to resolve this persistent 403 error on the Kubeflow login page would be highly appreciated.

MiraiChino commented 1 year ago

Additional Content: After setting the log level of oidc-authservice to DEBUG, I rechecked the logs when the error occurred again. I discovered that the error is related to boltstore/reaper, which is responsible for releasing unnecessary resources, rather than using boltdb for session management.

2023/08/01 03:22:10 boltstore: remove expired sessions error: input/output error
time="2023-08-01T03:22:57Z" level=warning msg="Request doesn't have a valid session." ip=192.168.200.15 request=/logout
time="2023-08-01T03:22:57Z" level=error msg="Failed to save state in store: error trying to save session: input/output error" ip=192.168.200.15 request=/
ben-omji commented 1 year ago

I got the same issue. First encounter with this problem in 70 days of creating the oidc-authservice pod.

juliusvonkohout commented 5 months ago

/close

oidc-authservice is deprecated and this discussion belongs to kubeflow/manifests

google-oss-prow[bot] commented 5 months ago

@juliusvonkohout: Closing this issue.

In response to [this](https://github.com/kubeflow/kubeflow/issues/7042#issuecomment-2129424021): >/close > >oidc-authservice is deprecated and this discussion belongs to kubeflow/manifests Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.