nats-io / nats.ws

WebSocket NATS
Apache License 2.0
318 stars 27 forks source link

Various errors are thrown that are uncatchable at the call site of nats.ws/connect #78

Closed cuddlecake closed 3 years ago

cuddlecake commented 3 years ago

Hi,

I noticed that nats.ws sometimes throws an error that cannot be caught at the call site.

For example, when I call connect with parameters reconnect: false and timeout: 5 to force a timeout error, I get an error log despite wrapping connect in try-catch:

Unhandled Promise rejection: TIMEOUT ; Zone: <root> ; Task: Promise.then ; Value: NatsError: TIMEOUT

Notably, if I wrap the connect call like this:

catchConnect = async () => {
    try {
        nats.connect({ reconnect: false, timeout: 5 }) 
    } catch (err) {
        console.log('caught an error')
    }
}

I get both, the Unhandled Promise rejection and the log caught an error in the console.

This kind of unhandle promise rejection occurs for example in:

aricart commented 3 years ago

@cuddlecake what version are you using? Can you please try with 119? npm install nats.ws@latest.

The issue with the error being thrown in an unhandled context was solved here: https://github.com/nats-io/nats.deno/pull/91

Also, the timeout of 5ms is really not right, as the client will never finish the connection that fast.

Finally, yes there's another issue lurking here: https://github.com/nats-io/nats.ws/issues/80

cuddlecake commented 3 years ago

We are using 1.0.0-119 (recently upgraded from 1.0.0-17. Ever since the update, these errors occur.

We're not actually using 5ms timeout in production, I have omitted other config parameters for brevity, but here's the full connect call parameters that we use:

nats.connect({
     servers: endpoints.updateUri,
     name: this.connectionName(this.viewerId),
     authenticator: nats.jwtAuthenticator(token),
     pingInterval: 30_000, // default: 120_000
     maxPingOut: 5, // default: 2
     reconnect: false,
     maxReconnectAttempts: 0, // default:10
})

Might it also be possible that those errors also occur during NatsConnection.subscribe? I can't really tell from the stacktrace whether it's from connect or subscribe (we only use connect and subscribe) because neither is mentioned in the stack trace.

aricart commented 3 years ago

@cuddlecake the error is always a timeout?

aricart commented 3 years ago

can you await nc.closed() and see what error you get there?

aricart commented 3 years ago

Is the scenario, you connect once, and then on reconnect, and then the timeout surfaces unhandled? You can track what the client is doing with:

(async () => {
  console.info(`connected ${nc.getServer()}`);
  for await (const s of nc.status()) {
    console.info(`${s.type}: ${JSON.stringify(s.data)}`);
  }
})().then();
aricart commented 3 years ago

not seeing the unhandled - perhaps I am missing something, but:

dialLoop is called from connect: (https://github.com/nats-io/nats.deno/blob/main/nats-base-client/protocol.ts#L342). which if it triggers rejects the connection.

And from disconnected (https://github.com/nats-io/nats.deno/blob/main/nats-base-client/protocol.ts#L229), which also handles it if fails.

dial, could also toss, but that is awaited: https://github.com/nats-io/nats.deno/blob/main/nats-base-client/protocol.ts#L314.

I am going to try adding some of the options you are using...

cuddlecake commented 3 years ago

@cuddlecake the error is always a timeout?

The types of errors we find in Sentry that are related to nats.ws are:

  1. Unhandled Promise rejection: An invalid or illegal string was specified, this one occurs only on Firefox
  2. Unhandled Promise rejection: Failed to construct 'WebSocket': The URL '<redacted>:4222' is invalid., this is the Chromium variant of the first error. I have redacted the string, but it's basically just ip-address:port, without the wss:// scheme. I'm not sure why this error occurs because the URI we insert is shaped like this: ${protocol}://${window.location.host}/api/update, where protocol is "wss:" or "ws:". (This might also be a completely different issue, since we're using a reverse proxy)
  3. Unhandled Promise rejection: TIMEOUT, I am not sure where this originates, because I don't receive a good stack trace here
  4. Unhandled Promise rejection: undefined ; Zone: <root> ; Task: Promise.then ; Value: NatsError NatsError, this usually occurs after a couple of unhandled TIMEOUT errors. This originates in the socket error callback.

I assume the timeout errors occur in low-connectivity environments or when the device is going offline (it's used extensively with mobile devices).

can you await nc.closed() and see what error you get there?

Not sure where I should await this, because the NatsConnection handle is only returned when the connection succeeds. The timeout error occurs during the connection attempt. Same for tracking nc.status()`

Some other things about how we use nats.ws:

If running the connect outside of ngZone doesn't work, I'll try to create a reproduction repository, but I won't be able to reproduce our entire stack (including distributed nats, websocket proxy, reverse proxy)

aricart commented 3 years ago

@cuddlecake the error is always a timeout?

The types of errors we find in Sentry that are related to nats.ws are:

  1. Unhandled Promise rejection: An invalid or illegal string was specified, this one occurs only on Firefox

  2. Unhandled Promise rejection: Failed to construct 'WebSocket': The URL '<redacted>:4222' is invalid., this is the Chromium variant of the first error. I have redacted the string, but it's basically just ip-address:port, without the wss:// scheme. I'm not sure why this error occurs because the URI we insert is shaped like this: ${protocol}://${window.location.host}/api/update, where protocol is "wss:" or "ws:". (This might also be a completely different issue, since we're using a reverse proxy)

If this issue happened before a connect, the client didn't have an URL and defaulted - this is the issue fixed in #80. I believe you have a client that is trying to connect without specifying a server - the base library was adding default host without going through the motions of converting to an URL. This is fixed in nats.ws@latest. - unless you are actually running your ws server on 4222, that is the issue fixed.

  1. Unhandled Promise rejection: TIMEOUT, I am not sure where this originates, because I don't receive a good stack trace here
  2. Unhandled Promise rejection: undefined ; Zone: <root> ; Task: Promise.then ; Value: NatsError NatsError, this usually occurs after a couple of unhandled TIMEOUT errors. This originates in the socket error callback.

I assume the timeout errors occur in low-connectivity environments or when the device is going offline (it's used extensively with mobile devices).

Wondering if the device is not properly populating the event message - or is simply calling it without an argument? https://github.com/nats-io/nats.ws/blob/master/src/ws_transport.ts#L133

can you await nc.closed() and see what error you get there?

Not sure where I should await this, because the NatsConnection handle is only returned when the connection succeeds. The timeout error occurs during the connection attempt. Same for tracking nc.status()`

Some other things about how we use nats.ws:

  • We use nats.ws with Angular, so we also use zone.js which patches the global Promise. I will try to implement the connect outside of the Angular zone, that might just be the fix.

interesting isn't the reason for zone.js to give you a stack trace - surprised that it is not providing more info.

So one interesting bit, not sure if this is related - the connection process has several promises that must resolve. Some of them are negative (such as the client took too long to handshake), the other is positive, where the PONG protocol message from the server was received and processed. These are used in a Promise.race() which results in either the timeout or the pong resolving. Wondering if the timeout you are logging is simply the timeout failing to resolve AFTER the connection resolved. Since the connection code takes the race, it works correctly, but there are two microtasks that are being queued, and it is possible that one resolves as the other is fails...

  • We do not listen to .closed() and do not listen on error events. Instead, we check every 4 seconds if the connection isClosed() and open a new one (this was the more reliable approach for 1.0.0-17 on iOS because the 'closed' event was not handled when the app was in the background).

OK - was just clarifying that this is just a first connect issue.

If running the connect outside of ngZone doesn't work, I'll try to create a reproduction repository, but I won't be able to reproduce our entire stack (including distributed nats, websocket proxy, reverse proxy)

aricart commented 3 years ago

One more if you are running via a reverse proxy, is your client ignoring server gossip?

Since you didn't specify whether you were setting one or more servers on your connect, I am wondering if you are also trying to connect to servers that are gossiped but not reachable.

On connect, the server will send information about the cluster. If the client is behind a proxy, the information that is gossiped is NOT reachable (it is however parsed correctly). Likely you'll want to set the option ignoreClusterUpdates https://github.com/nats-io/nats.ws#connection-options.

cuddlecake commented 3 years ago

Thank you for your advice and help, by the way.

Likely you'll want to set the option ignoreClusterUpdates

I set ignoreClusterUpdates to true now, so I think the problem with bad WebSocket URLs will go away. Will observe whether the issue is still getting logged in Sentry. FYI, for the connect parameter servers I only supplied the reverse-proxy URL like this: connect({ servers: 'wss://my-url/path' , /* not the actual URL, and omitted other params for brevity */ })

interesting isn't the reason for zone.js to give you a stack trace - surprised that it is not providing more info.

It does provide a stacktrace, the formatting is just bad and sourceMaps are not applied properly, so it's difficult to inspect. Off-topic: I rue the day I ever decided to use Angular for anything.

In any case, when I run the connect code outside of zone.js using NgZone.runOutsideAngular, I get less of those uncatchable errors, so I assume for now that zone is the culprit. I still get an error for the very first connection attempt when I set timeout to 5 milliseconds, but the others are fine. Might also have to do with the fact that I'm using RxJS to manage reconnects (since I'm not using the nats.ws reconnect options), and RxJS is not entirely compatible with zone.js.

So one interesting bit, not sure if this is related - the connection process has several promises that must resolve. Some of them are negative (such as the client took too long to handshake), the other is positive, where the PONG protocol message from the server was received and processed.

It's hard to debug these issues on my end due to them mainly appearing on end-user's devices in production, so for now, I think this issue can be closed, since it seems to be Angular specific. I will create new issues when I find a way to isolate and reproduce problems.

aricart commented 3 years ago

The 5ms timeout should be rejecting the connection should be a given, and that should be rejecting the connection properly. If it isn't for you let me know, and I will take a peek.

I would like you to verify if the devices are generating some possible unsafe conditions on the error handler for the clients. If that is the case I can protect it further (possibly just add additional assertions on the error event handling.

I do think that you should rely on the promises for closed() and the reconnect behavior of the client - but you know your environment better that me :)

cuddlecake commented 3 years ago

I do think that you should rely on the promises for closed() and the reconnect behavior of the client

I would prefer to use the reconnect behaviour of the client, but my problem is that we use a JWT for authorization (accessToken with limited lifespan) that is provided by a Promise, ensuring that it always resolves to a valid accessToken (either from local storage, which is the usual case, or from remote if a refresh is required). But, the authenticator option does not allow async (https://github.com/nats-io/nats.ws/issues/69) so I can't use that.

I could keep a local accessToken in my socket code, which I try to keep fresh by requesting a token periodically, but if the connect attempt is made with an invalid token even once, I think reconnect will also stop, and I'd have to restart reconnects manually anyway.

aricart commented 3 years ago

yes the authenticator cannot really do an async for that - because the authenticator is supposed to return quickly (before the server closes the connection). But that doesn't prevent your client from initializing pulling a JWT and then creating a timer that refreshes the JWT before it expires. If you get a rejection more than once the client will close, so then at that point you go through your startup sequence. Your JWTs shouldn't expire that frequently.