googleapis / nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
https://cloud.google.com/pubsub/
Apache License 2.0
518 stars 230 forks source link

Getting ENODATA and EADDRNOTAVAIL from PubSub after sending events #1843

Closed LukasKnuth closed 10 months ago

LukasKnuth commented 10 months ago

Environment details

Steps to reproduce

  1. Send an Event through Topic.publishMessage({json: event})
  2. The message is sent as expected
  3. We get the errors above in DataDog traces

Additional information

The Trace

  1. The trace shows that the code path to publish the event sends the event.
  2. We see pubsub.request for the correct/expected topic
  3. Next is grpc.client to the /google.pubsub.v1.Publisher/Publish path
  4. Within this call comes the first error in dns.resolve
Error: queryTxt ENODATA pubsub.googleapis.com
    at QueryReqWrap.onresolve [as oncomplete] (node:internal/dns/callback_resolver:47:19)
    at QueryReqWrap.callbackTrampoline (node:internal/async_hooks:130:17)
  1. On that same code path we get the next error in tcp.connect
Error: connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)
    at internalConnect (node:net:1100:16)
    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)
    at node:net:1306:9
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11)
  1. Right after the connection attempt to the IPv6 address, I see a successful attempt to 142.250.74.202:443
  2. I presume that path is asynchronous because on the same path we get repeated tcp.connect attempts around every four minutes. They all try to connect to an IPv6 address I don't recognize (like in the above error).

Here are some screenshots of the flame graph:

image

Zoomed out to max to see the repeated connection attempts

image

This is spamming our logs and causing false positives in the error reporting. The sending of events works as expected though. Would love to sort this out.

John98Zakaria commented 10 months ago

We are having the same issue as well (but with subscriptions) we had http2 debugging and grpc tracing on at the time of the incident.

In our case no more messages are being pulled. Acks and nacks aren't delivered (we are not sure about that) :/


ERROR 2023-10-20T20:05:48.449431624Z D 2023-10-20T20:05:48.449Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.450167540Z D 2023-10-20T20:05:48.450Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.451373185Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.452870711Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.453065069Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.455429453Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.455649536Z errno: -99,
ERROR 2023-10-20T20:05:48.455655139Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.455660692Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.455666785Z address: '2a00:1450:4001:80e::200a',
ERROR 2023-10-20T20:05:48.455671971Z port: 443
ERROR 2023-10-20T20:05:48.455685973Z }
ERROR 2023-10-20T20:05:48.455985840Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.456237292Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:80e::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.456285701Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 connection closed
ERROR 2023-10-20T20:05:48.456303572Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.456534614Z D 2023-10-20T20:05:48.456Z | subchannel | (85) 2a00:1450:4001:80e::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.456552437Z D 2023-10-20T20:05:48.456Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.456703562Z D 2023-10-20T20:05:48.456Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.457291793Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.457883349Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.458050683Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.458296839Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.458370179Z errno: -99,
ERROR 2023-10-20T20:05:48.458375751Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.458381411Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.458386972Z address: '2a00:1450:4001:80f::200a',
ERROR 2023-10-20T20:05:48.458392521Z port: 443
ERROR 2023-10-20T20:05:48.458397787Z }
ERROR 2023-10-20T20:05:48.458540773Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.458732527Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:80f::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.458858364Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 connection closed
ERROR 2023-10-20T20:05:48.459002652Z D 2023-10-20T20:05:48.458Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.459227436Z D 2023-10-20T20:05:48.459Z | subchannel | (86) 2a00:1450:4001:80f::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.459422298Z D 2023-10-20T20:05:48.459Z | subchannel | (87) 2a00:1450:4001:810::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.459660232Z D 2023-10-20T20:05:48.459Z | subchannel | (87) 2a00:1450:4001:810::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.460359637Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.461176918Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.461356881Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.461647123Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.461748549Z errno: -99,
ERROR 2023-10-20T20:05:48.461754307Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.461791167Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.461797891Z address: '2a00:1450:4001:810::200a',
ERROR 2023-10-20T20:05:48.461803923Z port: 443
ERROR 2023-10-20T20:05:48.461809972Z }
ERROR 2023-10-20T20:05:48.461952928Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.462147565Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:810::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.462308185Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 connection closed
ERROR 2023-10-20T20:05:48.462505629Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.462759040Z D 2023-10-20T20:05:48.462Z | subchannel | (87) 2a00:1450:4001:810::200a:443 TRANSIENT_FAILURE -> IDLE
ERROR 2023-10-20T20:05:48.462986925Z D 2023-10-20T20:05:48.462Z | subchannel | (88) 2a00:1450:4001:811::200a:443 TRANSIENT_FAILURE -> CONNECTING
ERROR 2023-10-20T20:05:48.463202698Z D 2023-10-20T20:05:48.463Z | subchannel | (88) 2a00:1450:4001:811::200a:443 creating HTTP/2 session
ERROR 2023-10-20T20:05:48.463943090Z HTTP2 1: Http2Session client: created
ERROR 2023-10-20T20:05:48.464716175Z HTTP2 1: Http2Session <invalid>: socket error [connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0)]
ERROR 2023-10-20T20:05:48.464895203Z HTTP2 1: Http2Session client: destroying
ERROR 2023-10-20T20:05:48.465142845Z HTTP2 1: Http2Session client: start closing/destroying Error: connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0) at __node_internal_captureLargerStackTrace (node:internal/errors:496:5) at __node_internal_exceptionWithHostPort (node:internal/errors:671:12) at internalConnect (node:net:1100:16) at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18) at node:net:1306:9 at processTicksAndRejections (node:internal/process/task_queues:77:11) at runNextTicks (node:internal/process/task_queues:64:3) at listOnTimeout (node:internal/timers:538:9) at process.processTimers (node:internal/timers:512:7) {
ERROR 2023-10-20T20:05:48.465231833Z errno: -99,
ERROR 2023-10-20T20:05:48.465238476Z code: 'EADDRNOTAVAIL',
ERROR 2023-10-20T20:05:48.465244753Z syscall: 'connect',
ERROR 2023-10-20T20:05:48.465250884Z address: '2a00:1450:4001:811::200a',
ERROR 2023-10-20T20:05:48.465256932Z port: 443
ERROR 2023-10-20T20:05:48.465262473Z }
ERROR 2023-10-20T20:05:48.465483218Z HTTP2 1: Http2Session client: finishSessionClose
ERROR 2023-10-20T20:05:48.465690270Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 connection closed with error connect EADDRNOTAVAIL 2a00:1450:4001:811::200a:443 - Local (:::0)
ERROR 2023-10-20T20:05:48.465884563Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 connection closed
ERROR 2023-10-20T20:05:48.465998695Z D 2023-10-20T20:05:48.465Z | subchannel | (88) 2a00:1450:4001:811::200a:443 CONNECTING -> TRANSIENT_FAILURE
ERROR 2023-10-20T20:05:48.466187245Z D 2023-10-20T20:05:48.466Z | subchannel | (88) 2a00:1450:4001:811::200a:443 TRANSIENT_FAILURE -> IDLE
feywind commented 10 months ago

I'd advise both of you to open up a support ticket, because this looks less like a library bug and more like an issue with the network path between you and the CPS services.

John98Zakaria commented 10 months ago

Can you point us in the right direction where to look for the problem? We started to get the disruptions at 20:30 last Tuesday without changing our code. We downgraded to node 16 and the problem was gone. So we belive something changed in the underlying nodejs http2 library (nghttp2) which now causes the reconnect not to work.

feywind commented 9 months ago

@John98Zakaria There was apparently a service-side change that activated a bug in Node 18's HTTP/2 stack, and that has since been rolled back. Node 18.18.2 is reported to be in the clear if you need 18.