rippleFCL / bws-cache

Bitwarden Secrets Manager cache server
MIT License
8 stars 3 forks source link

BWS service hangs every so often #40

Open joryirving opened 3 months ago

joryirving commented 3 months ago

I run the bws-cache on an external machine via docker, and every so often, it stops responding, giving timeout errors. The logs don't seem to indicate any issues in the container itself, but restarting the container resolves the issue.

If you let me know how to enable debug, or more verbatim logging, I can try to provide more information.

joryirving commented 3 months ago

2024-08-05 07:42:37,655 - client - DEBUG - cache miss for secret 1f312b3d-a2ce-463d-a32a-b11300f48c3d
DEBUG:client:cache miss for secret 1f312b3d-a2ce-463d-a32a-b11300f48c3d
172.17.0.1 - - [05/Aug/2024 07:42:37] "GET /key/volsync-r2-template HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:37] "GET /key/volsync-r2-template HTTP/1.1" 200 -
2024-08-05 07:42:38,641 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:38,644 - client - DEBUG - cache hit for secret db34fee6-832c-45cd-9425-b1120155bbef
DEBUG:client:cache hit for secret db34fee6-832c-45cd-9425-b1120155bbef
172.17.0.1 - - [05/Aug/2024 07:42:38] "GET /key/volsync-bucket HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:38] "GET /key/volsync-bucket HTTP/1.1" 200 -
2024-08-05 07:42:38,652 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:38,654 - client - DEBUG - cache hit for secret 4cc440b6-9098-48be-ad27-b111012ab903
DEBUG:client:cache hit for secret 4cc440b6-9098-48be-ad27-b111012ab903
172.17.0.1 - - [05/Aug/2024 07:42:38] "GET /key/volsync-minio-template HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:38] "GET /key/volsync-minio-template HTTP/1.1" 200 -
2024-08-05 07:42:39,844 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:39,846 - client - DEBUG - cache hit for secret 63e5956e-f276-4a25-a5c0-b11300f4075c
DEBUG:client:cache hit for secret 63e5956e-f276-4a25-a5c0-b11300f4075c
172.17.0.1 - - [05/Aug/2024 07:42:39] "GET /key/cloudflare HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:39] "GET /key/cloudflare HTTP/1.1" 200 -
2024-08-05 07:42:39,855 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:39,857 - client - DEBUG - cache hit for secret 1f312b3d-a2ce-463d-a32a-b11300f48c3d
DEBUG:client:cache hit for secret 1f312b3d-a2ce-463d-a32a-b11300f48c3d
172.17.0.1 - - [05/Aug/2024 07:42:39] "GET /key/volsync-r2-template HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:39] "GET /key/volsync-r2-template HTTP/1.1" 200 -
2024-08-05 07:42:41,710 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:41,712 - client - DEBUG - cache miss for secret 0637e5ef-bf85-4d7c-bdac-b1130181c863
DEBUG:client:cache miss for secret 0637e5ef-bf85-4d7c-bdac-b1130181c863
172.17.0.1 - - [05/Aug/2024 07:42:41] "GET /key/authentik HTTP/1.1" 200 -
INFO:werkzeug:172.17.0.1 - - [05/Aug/2024 07:42:41] "GET /key/authentik HTTP/1.1" 200 -
2024-08-05 07:42:43,943 - client - DEBUG - authenticating client
DEBUG:client:authenticating client
2024-08-05 07:42:43,945 - client - DEBUG - cache miss for secret 691981fb-12af-4838-be77-b113016bc6cc
DEBUG:client:cache miss for secret 691981fb-12af-4838-be77-b113016bc6cc```
tigattack commented 3 months ago

@joryirving thanks for the issue, and for enabling more verbose logging.

Could you provide some more context to your update? Clearly we could do with improving the logging here, however we don't have much to work with here without any further context.

Examples of some useful information to go along with the log output you've shown here would be:

joryirving commented 3 months ago

Sorry, @rippleFCL asked me to open an issue on discord a few days ago, and I turned on debugging at the time, and this is the first time the issue has reoccurred.

1) No hanging. 2) I have it running as a docker container on my NAS, for 2 k8s clusters to access via external secrets. 3) Eventually, external secrets will be unable to sync due to timeouts. My clustersecretstore shows valid. This happens regardless of where the container is running (docker, on my NAS, or as a k8s pod in cluster). 4) My fix is restarting the container. 4) I cannot reliably reproduce this.

rippleFCL commented 3 months ago

I shall improve the logging on bws-cache, there is not enough details to fully know what's going on. Can you share the timeouts/error codes in external secrets?

We also need to figure out if it truly is bws-cache hanging or if it's external secrets breaking. When It hangs could you run the curl commands in the readme and include the output from it(censoring any sensitive info if it returns a secret ofc).

Mainly want to see the query commands first. Then invalidate the cache. then rerun the query commands. This should give us some important info too

bluevulpine commented 2 months ago

I've been seeing this too - worked around it with a cronjob that restarts the BWS deployment occasionally and haven't given it much more thought. I'll disable that and attempt to gather diagnostics the next time it starts failing secret lookups.

I'm in a similar boat to joryirving - it works until it doesn't; the precise moment that "until it doesn't" occurs seems somewhat unpredictable, but usually for me takes at least a day or two of uptime; once it starts happening, it's a problem until the BWS-cache container is restarted.

My environment is a k8s environment on talos, managed by fluxcd. [helmrelease link]

joryirving commented 1 month ago
❯ k describe externalsecret external-dns-unifi-secret -n network --context utility
Name:         external-dns-unifi-secret
Namespace:    network
Labels:       app.kubernetes.io/name=external-dns-unifi
              kustomize.toolkit.fluxcd.io/name=external-dns-unifi
              kustomize.toolkit.fluxcd.io/namespace=flux-system
Annotations:  <none>
API Version:  external-secrets.io/v1beta1
Kind:         ExternalSecret
Metadata:
  Creation Timestamp:  2024-09-09T20:47:17Z
  Generation:          1
  Resource Version:    8454
  UID:                 6ea38394-fc26-4d4a-ab75-31e26b7d0a18
Spec:
  Data From:
    Extract:
      Conversion Strategy:  Default
      Decoding Strategy:    None
      Key:                  unifi
      Metadata Policy:      None
  Refresh Interval:         15m
  Secret Store Ref:
    Kind:  ClusterSecretStore
    Name:  bitwarden-secrets-manager
  Target:
    Creation Policy:  Owner
    Deletion Policy:  Retain
    Name:             external-dns-unifi-secret
    Template:
      Data:
        EXTERNAL_DNS_UNIFI_PASS:  {{ .EXTERNAL_DNS_UNIFI_PASS }}
        EXTERNAL_DNS_UNIFI_USER:  {{ .EXTERNAL_DNS_UNIFI_USER }}
      Engine Version:             v2
      Merge Policy:               Replace
Status:
  Conditions:
    Last Transition Time:  2024-09-09T20:48:47Z
    Message:               could not get secret data from provider
    Reason:                SecretSyncedError
    Status:                False
    Type:                  Ready
Events:
  Type     Reason        Age   From              Message
  ----     ------        ----  ----              -------
  Warning  UpdateFailed  106s  external-secrets  endpoint gave error 504 Gateway Time-out

I tried to curl, but it just hangs. Based on the pod logs, they container "froze". The last entry in the logs was cache miss secret... What's interesting is that the clustersecretstore appears to still be valid, but all the externalsecrets are SecretSyncedError

❯ curl https://bws-cache.domain.tld/key/<key>
<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>openresty</center>
</body>
</html>

Once again, restarting the container resolved the issue. The logs appeared to be ~4 days out of date, but I haven't changed any values, so I didn't noticed until now.

rippleFCL commented 8 hours ago

right, sorry for the delay. ive done some digging and seams to be in the auth method, we are doing a restructure for the bwsSDK 1.0.0 release. once ive finished reworking that and pushed a new version of bwsc. might be worth trying that out and seeing if there is a hanging problem