smartive / zitadel-node

ZITADEL SDK for Node.js
https://www.npmjs.com/package/@zitadel/node
Apache License 2.0
17 stars 8 forks source link

Intermittent failures causing node to panic #443

Open danielloader opened 3 months ago

danielloader commented 3 months ago

Bit of a strange one but figured I'd ask, I've got an issue where if a nodejs microservice is idle for some time (not known how long yet for a pattern) and then receives a request which requires a hit on the /zitadel.management.v1.ManagementService/GetMyOrg method, there's a high chance it'll crash the nodejs process with an uncaught exception.

If you need any more information I'll try to provide it for you.

I've enabled the following environment variables for information:

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

Debug level log for the failed call:

D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createResolvingCall [4600] method="/zitadel.management.v1.ManagementService/GetMyOrg", deadline=Infinity
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Created
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Deadline: Infinity
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] write() called with message of length 0
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] halfClose called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.quotech.io:443 createRetryingCall [4601] method="/zitadel.management.v1.ManagementService/GetMyOrg"
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Created child [4601]
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createLoadBalancingCall [4602] method="/zitadel.management.v1.ManagementService/GetMyOrg"
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] Created child call [4602] for attempt 1
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] Pick called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] Pick result: COMPLETE subchannel: (8) 104.21.88.227:443 status: undefined undefined
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | transport_flowctrl | (3) 104.21.88.227:443 local window size: 61879 remote window size: 2147483407
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | transport_internals | (3) 104.21.88.227:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] Created child call [4603]
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | retrying_call | [4601] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] sending data chunk of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | retrying_call | [4601] halfClose called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] halfClose called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] end() called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] calling end() on HTTP/2 stream
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | subchannel_call | [4603] HTTP/2 stream closed with code 8
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | subchannel_call | [4603] ended with status: code=1 details="Call cancelled"
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | load_balancing_call | [4602] Received status
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | load_balancing_call | [4602] ended with status: code=1 details="Call cancelled" start time=2024-08-21T12:49:03.834Z
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] Received status from child [4602]
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [4602] in state ACTIVE
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] ended with status: code=1 details="Call cancelled" start time=2024-08-21T12:49:03.834Z
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | resolving_call | [4600] Received status
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | resolving_call | [4600] ended with status: code=1 details="Call cancelled"
node:internal/process/promises:391
triggerUncaughtException(err, true /* fromPromise */);
^
ClientError: /zitadel.management.v1.ManagementService/GetMyOrg CANCELLED: Call cancelled
at wrapClientError (/app/node_modules/nice-grpc/lib/client/wrapClientError.js:9:16)
at Object.callback (/app/node_modules/nice-grpc/lib/client/createUnaryMethod.js:27:66)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:193:36)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
at process.processTicksAndRejections (node:internal/process/task_queues:77:11) {
path: '/zitadel.management.v1.ManagementService/GetMyOrg',
code: 1,
details: 'Call cancelled'
}
Node.js v22.6.0
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Debug level log for a successful call of the same endpoint (after hitting refresh a few times after the container is live again):

D 2024-08-21T12:53:49.368Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createResolvingCall [12] method="/zitadel.management.v1.ManagementService/ListUserMemberships", deadline=Infinity
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] Created
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] Deadline: Infinity
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] write() called with message of length 20
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] halfClose called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createRetryingCall [13] method="/zitadel.management.v1.ManagementService/ListUserMemberships"
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] Created child [13]
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createLoadBalancingCall [14] method="/zitadel.management.v1.ManagementService/ListUserMemberships"
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] Created child call [14] for attempt 1
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Pick called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Pick result: COMPLETE subchannel: (1) 104.21.88.227:443 status: undefined undefined
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | transport_flowctrl | (1) 104.21.88.227:443 local window size: 64976 remote window size: 2147483612
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | transport_internals | (1) 104.21.88.227:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Created child call [15]
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] write() called with message of length 25
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] write() called with message of length 25
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | subchannel_call | [15] write() called with message of length 25
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] sending data chunk of length 25
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | retrying_call | [13] halfClose called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | load_balancing_call | [14] halfClose called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] end() called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] calling end() on HTTP/2 stream
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | subchannel_call | [15] Received server headers:
:status: 200
date: Wed, 21 Aug 2024 12:53:49 GMT
content-type: application/grpc
cache-control: no-store
expires: Wed, 21 Aug 2024 12:53:49 GMT
grpc-metadata-cache-control: no-store
grpc-metadata-expires: Wed, 21 Aug 2024 12:53:49 GMT
grpc-metadata-pragma: no-cache
pragma: no-cache
trailer: Grpc-Status, Grpc-Message, Grpc-Status-Details-Bin
x-robots-tag: none
traceparent: 00-18f66a253324a3a3908ec887209acdf3-83531992162624ce-00
x-cloud-trace-context: 18f66a253324a3a3908ec887209acdf3/9462935357263062222
via: 1.1 google
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-cache-hit: uncacheable
alt-svc: h3=":443"; ma=86400
cf-cache-status: DYNAMIC
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=qCHEhO4Kiu8j1%2Ff7KzH4dNSa9W2ZC3cy3Voh5Ii%2BKyYQu4irGrfEnxa9RvQ4QgrEr34Du7W8Ql3RcaPgZl4ewrl%2FzUuNbq%2FlMVWmxR7jFvTtP6Aa7CKKIED5HKGZVyf45WzgG0CiLw%3D%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
expect-ct: max-age=86400, enforce
referrer-policy: same-origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-xss-protection: 1; mode=block
server: cloudflare
cf-ray: 8b6ac4a78b5c6545-LHR
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | load_balancing_call | [14] Received metadata
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | retrying_call | [13] Received metadata from child [14]
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | retrying_call | [13] Committing call [14] at index 0
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | resolving_call | [12] Received metadata
D 2024-08-21T12:53:49.474Z | v1.10.9 28 | subchannel_call | [15] receive HTTP/2 data frame of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] parsed message of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] pushing to reader message of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] Received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | retrying_call | [13] Received message from child [14]
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | resolving_call | [12] Received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | resolving_call | [12] Finished filtering received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] Received server trailers:
grpc-status: 0
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] received status code 0 from server
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] ended with status: code=0 details=""
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] Received status
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] ended with status: code=0 details="" start time=2024-08-21T12:53:49.369Z
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | retrying_call | [13] Received status from child [14]
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | retrying_call | [13] state=COMMITTED handling status with progress PROCESSED from child [14] in state ACTIVE
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | retrying_call | [13] ended with status: code=0 details="" start time=2024-08-21T12:53:49.369Z
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | resolving_call | [12] Received status
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | resolving_call | [12] ended with status: code=0 details=""
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | subchannel_call | [15] HTTP/2 stream closed with code 0
{"level":30,"time":1724244829480,"pid":28,"hostname":"risks-586d7dd84b-x9lhh","req":{"id":"afbc2f33-d4ff-47ac-989b-8e1c8a4357f0","method":"GET","url":"/insurance-products"},"res":{"statusCode":200},"responseTime":4,"msg":"request completed"}
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (2) [2606:4700:3031::6815:58e3]:443 refcount 1 -> 0
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (3) 172.67.153.180:443 refcount 1 -> 0
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (4) [2606:4700:3033::ac43:99b4]:443 refcount 1 -> 0
danielloader commented 3 months ago

I wondered if this is related: https://github.com/deeplay-io/nice-grpc/pull/609

Though the package.json in this repo is already utilising nice-grpc@2.1.9 already.

davidfiala commented 3 months ago

EDIT: looks like package-lock.json has grpc-js 1.10.8 at current head: https://github.com/smartive/zitadel-node/blob/cd68a02fa7661d30e4637ef7559d7b0378fdb9c6/package-lock.json#L175

My notes indicated some 1.10 versions may cause issues. https://github.com/deeplay-io/nice-grpc/pull/609#issuecomment-2151244474 though v1.10.8 was not listed at the time. I've forgotten why.

You could try bumping grpc-js to latest version and see what happens. But it may be a deeper issue, since I think I had a reason for listing only those bad 2 versions at the time and not a wider range.

danielloader commented 3 months ago

D 2024-08-21T20:36:06.867Z | v1.10.9 28 | load_balancing_call | [22] Received metadata

Unfortunately as evidenced by this log, adding my own dependency hasn't changed the fact the zitadel-node package is using or wants to use 1.10.9.

Just ended up with both in the yarn.lock:


"@grpc/grpc-js@1.11.1":
  version "1.11.1"
  resolved "https://registry.yarnpkg.com/@grpc/grpc-js/-/grpc-js-1.11.1.tgz#a92f33e98f1959feffcd1b25a33b113d2c977b70"
  integrity sha512-gyt/WayZrVPH2w/UTLansS7F9Nwld472JxxaETamrM8HNlsa+jSLNyKAZmhxI2Me4c3mQHFiS1wWHDY1g1Kthw==
  dependencies:
    "@grpc/proto-loader" "^0.7.13"
    "@js-sdsl/ordered-map" "^4.4.2"

"@grpc/grpc-js@^1.10.8":
  version "1.10.9"
  resolved "https://registry.yarnpkg.com/@grpc/grpc-js/-/grpc-js-1.10.9.tgz#468cc1549a3fe37b760a16745fb7685d91f4f10c"
  integrity sha512-5tcgUctCG0qoNyfChZifz2tJqbRbXVO9J7X6duFcOjY3HUNCxg5D0ZCK7EP9vIcZ0zRpLU9bWkyCqVCLZ46IbQ==
  dependencies:
    "@grpc/proto-loader" "^0.7.13"
    "@js-sdsl/ordered-map" "^4.4.2"

Not sure how to override this to force it to use 1.11.1.

Edit: Have since learned about resolutions, added this to package.json, trying again:

  "resolutions": {
      "@grpc/grpc-js": "1.11.1"
  }

Edit2: Come back with bad news:

D 2024-08-21T20:52:05.326Z | v1.11.1 28 | retrying_call | [17] halfClose called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | load_balancing_call | [18] halfClose called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | subchannel_call | [19] end() called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | subchannel_call | [19] calling end() on HTTP/2 stream
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | subchannel_call | [19] HTTP/2 stream closed with code 8
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | subchannel_call | [19] ended with status: code=1 details="Call cancelled"
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | load_balancing_call | [18] Received status
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | load_balancing_call | [18] ended with status: code=1 details="Call cancelled" start time=2024-08-21T20:52:05.326Z
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] Received status from child [18]
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [18] in state ACTIVE
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] ended with status: code=1 details="Call cancelled" start time=2024-08-21T20:52:05.326Z
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | resolving_call | [16] Received status
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | resolving_call | [16] ended with status: code=1 details="Call cancelled"
node:internal/process/promises:391
    triggerUncaughtException(err, true /* fromPromise */);
    ^
ClientError: /zitadel.management.v1.ManagementService/GetMyOrg CANCELLED: Call cancelled
    at wrapClientError (/app/node_modules/nice-grpc/lib/client/wrapClientError.js:9:16)
    at Object.callback (/app/node_modules/nice-grpc/lib/client/createUnaryMethod.js:27:66)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:193:36)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11) {
  path: '/zitadel.management.v1.ManagementService/GetMyOrg',
  code: 1,
  details: 'Call cancelled'
}
Node.js v22.6.0
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

same issue and the log is nice enough to show it's the newer version.

danielloader commented 3 months ago

We've ended up wrapping the calls this library makes in our code with a try/catch and returning a 503 if there's an exception raised so the client can decide to retry the request.

If retry friendly exceptions were bubbled up from nice grpc it would be nice if this library had its own retry mechanism to attempt to smooth out transient issues but as this mitigates our customer affecting issue well enough I'm happy to just leave this issue up for future comments or closure.