microsoft / etcd3

:bookmark: Node.js client for etcd3
https://microsoft.github.io/etcd3/classes/etcd3.html
Other
518 stars 73 forks source link

Persistent GRPC Internal Failure after 2-6 hours - How to enhance auto recovery #153

Closed pdykes closed 3 years ago

pdykes commented 3 years ago

After 2-6 hours, a permanent CRUD error failure returned until an instance restart occurs is bubbling up from GRPC:

"GRPCInternalError: 13 INTERNAL: Received RST_STREAM with code 2 triggered by internal client error: read ETIMEDOUT"

After experimenting, and fully validating the networking is NOT the issue, it seems I'm looking for advice to best configure the node etcd3 client for long running etcd3 transactional usage. I have 3 instances and see issues in all of them (they are configured exactly the same/same code levels, so not surprising).

Based on last weeks testing of the latest version of this module, It appears after 2-8 hours, the client code has some GRPC timeout, and from that point the CRUD methods encounter this error. Also, the watcher function becomes unstable. A second question, if the answer is to configure grpc, vs. etcd3, that would be good (i was wondering if etcd3 self tunes grpc, and configuring orthogonally maybe bad thing to pursue).

If I restart my kubernetes pods, then it works again. I have looked at the docs on the recovery, but I am looking for a keep alive at the grpc level, set via the etc3 api/config that would make the client code more resilient.

Thanks

pdykes commented 3 years ago

Follow up:

I put the "lease" and lease.put in a try catch, and can catch and try again. However, the second put attempt against lease always gets a circuit breaker exception. Any advice?

pdykes commented 3 years ago

FYI, this issue continues with the latest build, pulling in clean dependencies, tracing assistance be great with any suggestions for the Node Library. Thanks.

pdykes commented 3 years ago

I wanted to update folks... I looked over the @grpc/grpc-js changes for the pure javascript client, and they had some issues in the pre 1.35 version and recently dropped a 1.36 versions. I took all my builds, ensured I was at 1.36 vs. earlier versions and restarted all testing and so far, so good. The ETIMEDOUT has pretty much disappeared as denoted above. I noticed they dropped a 1.37 over the weekend. I am behind, so i'm gong to stick with the 1.36 for now, but just heads up in case following this - it was a painful 6 week to find this seemed to fix the issue.