owncloud / ocis-charts

:chart_with_upwards_trend: Helm Charts for ownCloud's OCIS
https://owncloud.dev/ocis/deployment/kubernetes/
Apache License 2.0
46 stars 27 forks source link

HTTP Error 500 + "Could not load roles" #558

Closed euh2 closed 4 months ago

euh2 commented 4 months ago

I could need some pointers on how to further debug this. I try to deploy this chart, but am stuck at "Internal Server Error". It seems to be connected to when storageusers is set to s3ng. If I set it to ocis and redeploy the chart from scratch, I can login.

First some logs and then my config:

ingress-nginx

12.34.56.78 - - [13/May/2024:21:16:18 +0000] "GET /ocs/v1.php/cloud/user HTTP/2.0" 500 0 "https://ocis.domain.tld/web-oidc-callback?code=qgKy4tNlcX7EYY_n0NJAXT9IKvhzfYuI&scope=openid%20profile%20email&session_state=eeff02d8d4a7dbce9d755cd127eee89380992734c126ee12ca0ae2632fe6fe15.cXW_U1gVwcSYs4BQGfzw9V4IkK6YRbNh33OcWs-i838&state=5e0962bc8abd431cbd7a09b6ed7d3259" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36" 1258 0.073 [ocis-proxy-9200] [] 10.233.182.96:9200 0 0.074 500 6310a8cf-ecac-41ab-94bf-fb6e65cf7189

proxy-..-pod:

{"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"00380f3e-52b6-4dc2-a8db-7d5cb2ab88da","traceid":"503282f7958ea7331a359732389d547e","remote-addr":"12.34.56.78","method":"GET","status":200,"path":"/ocs/v1.php/cloud/capabilities","duration":3.193941,"bytes":3105,"time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
{"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"f9c20b29dd32806513a4b7693748717e","traceid":"67c9e2f9d84664033b44c34b48e7fcb5","remote-addr":"xxx.xxx.63.230","method":"GET","status":200,"path":"/konnect/v1/userinfo","duration":37.805854,"bytes":265,"time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
{"level":"error","service":"proxy","error":"{\"id\":\"com.owncloud.api.settings\",\"code\":502,\"detail\":\"circuit breaker is open\",\"status\":\"Bad Gateway\"}","time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/userroles/defaultrole.go:38","message":"Could not load roles"}
{"level":"error","service":"proxy","error":"{\"id\":\"com.owncloud.api.settings\",\"code\":502,\"detail\":\"circuit breaker is open\",\"status\":\"Bad Gateway\"}","time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/account_resolver.go:140","message":"Could not get user roles"}
{"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"6310a8cf-ecac-41ab-94bf-fb6e65cf7189","traceid":"0a15fd98788bd0f25562bf66bd5b62dd","remote-addr":"12.34.56.78","method":"GET","status":500,"path":"/ocs/v1.php/cloud/user","duration":70.088662,"bytes":0,"time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}
{"level":"info","service":"proxy","proto":"HTTP/1.1","request-id":"64f0ace2a798d5b14ec4bfd177c2af50","traceid":"f3ba7e9868dedf663bcb42f7b76267f9","remote-addr":"12.34.56.78","method":"GET","status":200,"path":"/themes/owncloud/assets/logo.svg","duration":0.923335,"bytes":9896,"time":"2024-05-13T21:16:18Z","line":"github.com/owncloud/ocis/v2/services/proxy/pkg/middleware/accesslog.go:34","message":"access-log"}

Here it looks like an Gateway Error (502) which results in "Could not load roles".

My values.yaml:

values.yaml ```yaml externalDomain: ocis.domain.tld ingress: enabled: true annotations: nginx.ingress.kubernetes.io/ssl-redirect: "true" cert-manager.io/cluster-issuer: "issuer" nginx.ingress.kubernetes.io/proxy-body-size: "0" # nginx.ingress.kubernetes.io/backend-protocol: HTTP # nginx.ingress.kubernetes.io/default-backend: "proxy" nginx.ingress.kubernetes.io/custom-http-errors: "404,415,431" nginx.ingress.kubernetes.io/client-body-buffer-size: 512k nginx.ingress.kubernetes.io/client-header-buffer-size: 512k nginx.ingress.kubernetes.io/http2-max-header-size: 512k nginx.ingress.kubernetes.io/large-client-header-buffers: 4 1m nginx.ingress.kubernetes.io/proxy_buffers: 4 256k nginx.ingress.kubernetes.io/proxy-buffer-size: 128k nginx.ingress.kubernetes.io/proxy-busy-buffers-size: 256k nginx.ingress.kubernetes.io/proxy-read-timeout: "5000" nginx.ingress.kubernetes.io/proxy-send-timeout: "5000" nginx.ingress.kubernetes.io/server-snippet: | client_header_buffer_size 512k; large_client_header_buffers 4 1m; tls: - hosts: - ocis.domain.tld secretName: ocis-tls services: idm: persistence: enabled: true size: 1Gi storageClassName: hot nats: persistence: enabled: true size: 1Gi storageClassName: hot search: persistence: enabled: true size: 1Gi storageClassName: hot storagesystem: persistence: enabled: true size: 1Gi storageClassName: hot accessModes: - ReadWriteMany storageusers: storageBackend: # -- Configures the storage driver. Possible values are "ocis" and "s3ng". # The oCIS driver stores all data in the persistent volume if persistence is enabled. # The S3NG driver stores all metadata in the persistent volume and uploads blobs to s3 if persistence is enabled. driver: s3ng driverConfig: s3ng: # -- Maximum number of concurrent go-routines. Higher values can potentially get work done faster but will also cause more load on the system. maxConcurrency: 100 # -- Metadata backend to use for the S3NG storage driver. Valid values are: "messagepack", "xattrs". metadataBackend: messagepack # The S3NG driver needs an existing S3 bucket with following permissions: # { # "Version": "2012-10-17", # "Statement": [ # { # "Sid": "ListBucket", # "Effect": "Allow", # "Action": [ # "s3:ListBucket" # ], # "Resource": [ # "arn:aws:s3:::bucket-name" # ] # }, # { # "Sid": "ActionsInBucketContext", # "Effect": "Allow", # "Action": [ # "s3:*Object", # "s3:*MultipartUpload", # "s3:ListMultipartUploadParts" # ], # "Resource": [ # "arn:aws:s3:::bucket-name/*" # ] # } # ] # } # -- S3 endpoint to use for the S3NG driver. Only used if driver is set to "s3ng". endpoint: https://minio.minio-hot.svc # -- S3 region to use for the S3NG driver. Only used if driver is set to "s3ng". region: default # -- S3 bucket to use for the S3NG driver. Only used if driver is set to "s3ng". bucket: ocis-storageusers # -- Disable sending content sha256 when copying objects to S3. disableContentSHA256: false putObject: # -- Disable multipart uploads when copying objects to S3 disableMultipart: false # -- Send a Content-MD5 header when copying objects to S3. sendContentMD5: true # -- Always precreate parts when copying objects to S3. concurrentStreamParts: true # -- Number of concurrent uploads to use when copying objects to S3. numThreads: 4 # -- Part size for concurrent uploads to S3. partSize: 0 persistence: enabled: true size: 20Gi storageClassName: hot accessModes: - ReadWriteMany thumbnails: persistence: enabled: true size: 10Gi storageClassName: hot web: persistence: enabled: true size: 1Gi storageClassName: hot secretRefs: s3CredentialsSecretRef: hot-obst-keys ```

There are quiet a few annotations. That's because I first got HTTP Error 431, too large header. Managed to get past that. But now I am at 500 or 502. As mentioned, probably connected to storageusers driver setting.

One thing I wounder if there exists a insecure setting for the s3ng driver? Maybe the 502 error comes from ocis not being able to verify the minio certificate?

Any hints?

wkloucek commented 4 months ago

Can you please check if you see a line like the following on the settings pods when you try to log in?

settings-576f978c9f-ld5c8 settings {"level":"error","service":"ocis","error":"error: not found: create container: error: not found: f1bdd61a-da7c-49fc-8203-0558109d1b4f!f1bdd61a-da7c-49fc-8203-0558109d1b4f/settings","time":"2024-05-16T05:50:52Z","line":"github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70","message":"error initializing metadata client"}

euh2 commented 4 months ago

Yes, I do!

2024-05-16T07:13:01Z ERR error initializing metadata client error="internal error: error creating space" line=github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70 service=ocis
euh2 commented 4 months ago

And after a few minutes doing nothing, these show up in the settings pod:

2024-05-16T07:18:22Z ERR error initializing metadata client error="internal error: delete:msgpack : invalid code 0 decoding map" line=github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70 service=ocis
2024-05-16T07:18:22Z ERR error initializing metadata client error="internal error: delete:msgpack : invalid code 0 decoding map" line=github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70 service=ocis
2024-05-16T07:18:22Z ERR error initializing metadata client error="internal error: delete:msgpack : invalid code 0 decoding map" line=github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70 service=ocis
2024-05-16T07:19:07Z ERR error initializing metadata client error="internal error: delete:msgpack : invalid code 0 decoding map" line=github.com/owncloud/ocis/v2/services/settings/pkg/store/metadata/store.go:70 service=ocis
wkloucek commented 4 months ago

In my case I reset the data on the storagesystem-data PVC. If you do that, you'd also need to reset the storagesystem cache. If you use an external NATS installation (see https://github.com/owncloud/ocis-charts/blob/9023c40fd1d28e00cd29819fffd82eafb43f58a2/deployments/ocis-nats/helmfile.yaml#L6-L10), you can purge that cache like this: kubectl exec -n ocis-nats deployments/nats-box -- nats stream prune --all KV_storage-system

euh2 commented 4 months ago

Using the deployment example with external nats brought me a bit further. Now I am able to log in. But now I'm having trouble to list storage spaces:

2024-05-16T13:35:18Z ERR failed to list storage spaces error="error reading user index: stat /var/lib/ocis/storage/users/indexes/by-user-id/6b28f78c-ef67-4ff9-9e2f-a1c90cc48875.mpk: no such file or directory" filters=[{"Term":{"SpaceType":"project"},"type":4},{"Term":{"User":{"idp":"https://ocis.domain.tld","opaque_id":"6b28f78c-ef67-4ff9-9e2f-a1c90cc48875"}},"type":6}] line=github.com/cs3org/reva/v2@v2.19.6/internal/grpc/services/storageprovider/storageprovider.go:575 pkg=rgrpc service=storage-users status={"code":15,"message":"error listing spaces","trace":"866fa6adf2ff392bba234d7b5429d0de"} traceid=866fa6adf2ff392bba234d7b5429d0de

When I look at the directory, it is indeed empty:

~ $ ls -la /var/lib/ocis/storage/users/indexes/by-user-id/
total 8
drwx------    2 ocis-use ocis-gro      4096 May 16 13:32 .
drwx------    5 ocis-use ocis-gro      4096 May 16 13:32 ..
~ $

Inside web I also see no Files, probably because ocis cannot list storage spaces. screenshot

In my case I reset the data on the storagesystem-data PVC

What do you mean by that? I try to deploy ocis from scratch, so I destroy everything and deploy again with some changed values. I shouldn't need to reset. Or do I?


Also when I use Monitor > Trace > S3 inside Minio, I don't see anything coming from ocis. Although the connection details are correct. I can e.g. do a curl from the storageusers-pod:

~ $ curl --verbose http://minio.minio-hot.svc.cluster.local
* Host minio.minio-hot.svc.cluster.local:80 was resolved.
* IPv6: (none)
* IPv4: 10.96.121.235
*   Trying 10.96.121.235:80...
* Connected to minio.minio-hot.svc.cluster.local (10.96.121.235) port 80
> GET / HTTP/1.1
> Host: minio.minio-hot.svc.cluster.local
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Accept-Ranges: bytes
< Content-Length: 254
< Content-Type: application/xml
< Server: MinIO
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< Vary: Origin
< Vary: Accept-Encoding
< X-Amz-Id-2: 0a0da9bb6b6d3c38dfeab2cc7e8e6ed0fc8ccf27dcf29c14fd96819b66189b4d
< X-Amz-Request-Id: 17CFFCA1BA8B8480
< X-Content-Type-Options: nosniff
< X-Xss-Protection: 1; mode=block
< Date: Thu, 16 May 2024 13:55:54 GMT
<
<?xml version="1.0" encoding="UTF-8"?>
* Connection #0 to host minio.minio-hot.svc.cluster.local left intact
<Error><Code>AccessDenied</Code><Message>Access Denied.</Message><Resource>/</Resource><RequestId>17CFFCA1BA8B8480</RequestId><HostId>0a0da9bb6b6d3c38dfeab2cc7e8e6ed0fc8ccf27dc
~ $

The curl above shows up in the Minio Trace, but nothing else from ocis is showing up. Hmmm!

euh2 commented 4 months ago

When I combine the nats example with the s3 example it works! I will dig a bit deeper tomorrow.

wkloucek commented 4 months ago

Switching from the builtin nats to the external nats removes all state (thus acting like the purge command I provided).

In my case I reset the data on the storagesystem-data PVC

What do you mean by that? I try to deploy ocis from scratch, so I destroy everything and deploy again with some changed values. I shouldn't need to reset. Or do I?

You shouldn't. But if you assume that you're deploying from scratch, you should challenge / verify if that's really the case. Eg. you should make sure that PersistentVolumes are gone and in the case you use Minikube with the Hostpath provisioner, you actually need to clean up manually (minikube ssh 'sudo rm -rf /tmp/hostpath-provisioner/*')

euh2 commented 4 months ago

I think I found the root cause for these issues in my deployment. I run Longhorn and their rwx access mode uses NFS. My storageClass was configured with these NFS options: nfsOptions: "vers=4.2,noresvport,softerr,timeo=600,retrans=5". I removed them and use Longhorn defaults instead. That seems to work!

Thanks @wkloucek for your assistance!