Client only connects 50% of the time using TLS and doesn't reconnect

basedwon commented 2 years ago

Client only connects 50% of the time using TLS and doesn't reconnect. I can't seem to catch the error to force a reconnect. Looking through the source code it looks like you might not be catching errors from the isomorphic ws to initiate the reconnection process. This is a major problem because most sites will be using TLS and this makes connecting to the network a very inconsistent and unreliable action.

basedwon commented 2 years ago

Here's my not-so-good work around, sometimes connects twice

let timer
const connect = () => {
  clearTimeout(timer)
  const client = new nkn.Client()
  timer = setTimeout(connect, 2000)
  return new Promise((resolve, reject) => {
    client.onConnect((data) => {
      clearTimeout(timer)
      resolve(client)
    })
  })
}
const client = await connect()

yilunzhang commented 2 years ago

The issue you got is because the node does not support https, therefore reconnect doesn't help.

Not every node supports https because it needs a certificate issued by CA, only support port 443 (due to let's encrypt restriction), and root privilege (to open port 443). We are trying our best to let as many nodes support https as possible, but in the end it's still up to the node providers.

Supporting https is generally a pain for decentralized node because the whole logic of https is centralized: it needs to use a (sub)domain, which relies a centralized name server to resolve to IP address; and it needs a CA, which is issued by a centralized CA (and often not free); it relies on browser to verify the certificate, which can block the domain at any time.

There are a few solutions to this:

The most decentralized solution is to avoid https context, like using ipfs to host app, or making it a browser extension.
Use multiclient, which which create multiple client under the hood. It will work as long as at least one client connects to https successfully. Actually we always recommend using multiclient for better latency, reliability, and more features (e.g. session). By default a multiclient creates 4 clients, but you can use more if you want..

basedwon commented 2 years ago

Ok, that makes sense. I was trying to make it work with both single and multiclient mode. But I guess I can just use multi if that's what you'd recommend. Would there be a way to catch the failed connection as an error? It seems like even with multi client mode, you'd still want to ensure a certain minimum number of connected nodes. Otherwise, I could use the timeout to achieve this.

And I do agree with you regarding TLS and its centralized nature. I'm not a fan actually, and that's part of what I like about NKN, that it gives you a truly e2ee wire/transport. But then all the users like to see that green lock in their URL bar, so, it's a double edged sword

Thanks for your prompt and thorough response @yilunzhang

yilunzhang commented 2 years ago

Currently there is only event for successful connection but not failure because client will always try to reconnect, so a "failed" client just means not connected yet, and it might become connected later.

For multiclient, a connected event will be triggered when at least one client is connected. If you want to see the state of each client, you can use multiclient.clients to get underlying clients, and check their isReady state. You can find more more information about what attributes/methods are available at https://docs.nkn.org/nkn-sdk-js/

basedwon commented 2 years ago

I guess that's the problem I'm having though, is that if connection fails it never tries again. Which is why I'm setting that timeout to retry. It's not really a blocker because I can just use the multi-client (although I haven't tested that yet). But I'd think you could put the connection inside a try/catch block and then emit a "failed" event if connection fails. Then you could set the multi-client to 4 concurrent nodes and it would eventually achieve that even if it has to go through so many nodes that aren't https compatible.

The main reason I'm doing this with TLS is because I'm working with service workers which require TLS. I would also like to achieve as much parallelism as possible, so if I set the multi-client to 8 I'd like it to connect to 8 nodes. And what happens if all 8 fail? Then it won't connect, and how would I ensure it does connect at all? I'll have to use timeouts to ensure connection. Which seems like a less than desirable solution.

yilunzhang commented 2 years ago

The problem here is that reconnect logic is only triggered when client is first connected to the node. In your case when node doesn't support wss, reconnect will not be triggered because client will try the same node during reconnect (because of DHT topology requirement) and it will fail again.

I think what can be improved is that, when the client fails to connect for the first time (thus reconnect will not be triggered), it can emit a fail event. But if the client has connected to the node and somehow disconnected later (it will try reconnect in this case), it won't emit fail event.

For multiclient it would be tricky. I would prefer a multiclient only emit fail event when ALL of the underlying clients failed. This way fail event consistently means the client/multiclient is no longer usable.

What do you think of it?

basedwon commented 2 years ago

Yes, that's exactly what I'm thinking. My biggest concern is that a client wouldn't connect at all. Getting the full number of parallel nodes isn't a necessity it would just be nice. I can appreciate that doing this with the multi-client could be complex however. But emitting an event for failed connection, whether for the single client node or all of the multi-client nodes would give me/the developer the ability to retry until the on connect event is emitted.

I'm not setup for typescript, but I'd be happy to help implement this if you want. I can still continue working without this feature for now, but I'd really like to have a connection guarantee for each user. I am building a whole platform using NKN as the infrastructure and I want it to be just as reliable as the normal internet. I believe NKN will be a game changer if it can compete in this way. Thanks for your work @yilunzhang

yilunzhang commented 2 years ago

But emitting an event for failed connection, whether for the single client node or all of the multi-client nodes would give me/the developer the ability to retry until the on connect event is emitted.

It should be pretty straightforward to listen for single client connect failed event as well, just add a handler to each client the multiclient creates.

I'm not setup for typescript, but I'd be happy to help implement this if you want.

Really appreciated that! Sadly I didn't check my email on time, and when I saw this I already have it implemented... Will create a PR shortly so you can review and see how it looks to you.

yilunzhang commented 2 years ago

Just created a PR #91 for connect failed handler. The added example code also shows how to listen for connect failed event on each client that a multiclient creates so you can have more control. Please feel free to let me know how it looks to you.

basedwon commented 2 years ago

Wow @yilunzhang! That was amazingly fast! I'll check it out right now and let you know how it goes. It'll probably take me longer to test it than it took you to implement it : )

basedwon commented 2 years ago

It looks like it's working perfectly with the single client. Next I'll test it with the multi-client.

One note, this is perhaps just a personal preference, but I prefer it when libs don't console.log unless I want it to, by using a debug option, for example. So, in this new code it looks like you log "Client connect failed" and then you're also logging an error even if there isn't one.

Here's a screengrab of the culprits:

I don't think anything can be done about the WebSocket failed error, but maybe you could pull those other logs out?

The common pattern is to just emit an error event and the dev can decide what to do with it, but a debug flag could work also.

basedwon commented 2 years ago

But, it seems to work great! It's connecting on the first try more often than not which is a good sign. Here's a little snippet I made to connect after num tries using a promise:

const connect = (opts, tries = 2, num = 0, cb) => {
  const client = new nkn.Client(opts)
  return new Promise((resolve, reject) => {
    client.onConnectFailed(() => {
      num++
      if (num < tries)
        resolve(connect(opts, tries, num, cb))
      else
        reject(new Error(`failed to connect after ${tries} tries`))
    })
    client.onConnect(data => {
      if (typeof cb === 'function')
        cb(data)
      resolve(client)
    })
  })
}
const client = await connect()
console.log(client)

I'll test the multi-client and then I could put this in the examples somewhere, what do you think?

yilunzhang commented 2 years ago

The common pattern is to just emit an error event and the dev can decide what to do with it, but a debug flag could work also.

That's a good point. I changed the log behavior so it only prints when no onConnectFailed handler is added and has pushed it to the PR.

I'll test the multi-client and then I could put this in the examples somewhere, what do you think?

I was thinking maybe you can create a new file in examples folder, and the existing client.js serves as a basic example. In the future I think it's better to have different examples, like basic send/receive data, how to use session, how to detect and handle connect failure, etc. It should scale better when we have more complete and complex examples.

basedwon commented 2 years ago

I was thinking maybe you can create a new file in examples folder, and the existing client.js serves as a basic example. In the future I think it's better to have different examples, like basic send/receive data, how to use session, how to detect and handle connect failure, etc. It should scale better when we have more complete and complex examples.

Yeah, that makes sense, I can definitely contribute to that. I've been working with session and doing some more advanced things with it.

basedwon commented 2 years ago

@yilunzhang -- question about the multi-client: since you have to put the onConnectFailed handler on each client, what would be the best way to use that clientId to try and connect again? Would I use clientID as the identifier and just put it on the clients object manually? Or is there a better built-in internal method way to do that?

yilunzhang commented 2 years ago

You can use multiclient.clients[clientID] to manually access the client. Then you can call internal methods like _connect to manually trigger reconnect.

But on the other hand, we are trying our best to make the client connect/reconnect to node whenever possible. In the case of connect failed, 99% of the times manual reconnect won't help either (e.g. node doesn't support wss). If it does help, we might want to improve the SDK. So eventually we hope users don't need to manually call those internal methods at all.

basedwon commented 2 years ago

Ok, so it looks like it only calls _reconnect if shouldReconnect is true, and that doesn't happen until ws.onopen is triggered. So, I could just fire _connect on each client in the onConnectFailed handler. And it looks like it will find a new address each time I call _connect.

I'll play around with it, and then this is definitely something that would be good to bake into the SDK. A connect method that returns a promise and that accepts the max number of tries perhaps. Also, we could put an auto connect from the constructor so they wouldn't have to use the connect method.

I'll play around...

yilunzhang commented 2 years ago

Ok, so it looks like it only calls _reconnect if shouldReconnect is true, and that doesn't happen until ws.onopen is triggered. So, I could just fire _connect on each client in the onConnectFailed handler. And it looks like it will find a new address each time I call _connect.

Yes you can do that, but it will NOT find a new address each time unless network topology has changed (e.g. that particular node goes offline), which is very rare. Most likely it will return the same node, and client will try to connect to it again.

The reason it's returning the same node is because of DHT topology restrictions. A client is having its own unique and permanent address on DHT overlay, and it will connect to the node that is closest to that client address. This way packet can be routed to destination easily in a purely decentralized way. Think of it this way: if each time it gives you a new address, then how does the packet routing algorithm know where to route the packet give there is no centralized system storing all these information? That's why calling _connect method again (almost always) makes no difference.

basedwon commented 2 years ago

Ok, I see, I'm pretty familiar with DHT's -- but why does it connect to a new node every time on single client mode?

yilunzhang commented 2 years ago

You mean each time you call it on the same client, or each time you are creating a new client (without passing seed so it generates a random one)?

If you call it on the same client, it shouldn't change the node it connects to unless the previous node is offline or a closer on gets online.

If you are comparing different client (each time create a new client with new nkn.Client()), then it will generate a random key pair and thus having different DHT address. Client DHT address is the hash of client identifier + public key, so a different key pair definitely change which node it connects to.

basedwon commented 2 years ago

Ah, yes, of course. So if you want to multiplex with 4 nodes, those will always be the 4 closest nodes to your address?

yilunzhang commented 2 years ago

Not really. Multiclient does multiplex by creating clients with different identifier prefix. So if you want to create a multiclient with identifier id and public key pk, you will use id.pk as multiclient address, while multiclient will create 4 clients:

__0__.id.pk
__1__.id.pk
__2__.id.pk
__3__.id.pk

and the DHT address of these 4 clients are

sha256(__0__.id.pk)
sha256(__1__.id.pk)
sha256(__2__.id.pk)
sha256(__3__.id.pk)

This way the multiclient will connect to 4 completely irrelevant nodes for best availability and least affected by any DHT topology relevant issue.

basedwon commented 2 years ago

Here's a class for the single client, I'll do similar with the multi and then try to combine them which was my original goal.

class Client {
  constructor(opts = {}) {
    Object.defineProperty(this, '_opts', { value: opts })
    Object.defineProperty(this, '_events', { value: { connect: [], error: [] }})
    Object.defineProperty(this, '_connecting', { value: false, writable: true })
    this.connect()
  }
  onConnect(fn) {
    this._events.connect.push(fn)
  }
  onError(fn) {
    this._events.error.push(fn)
  }
  connect(tries) {
    if (this.client)
      return Promise.resolve(this.client)
    if (this._connecting) {
      return new Promise((resolve, reject) => {
        this.onConnect(resolve)
        this.onError(reject)
      })
    } else
      this._connecting = true
    tries = tries || this._opts.tries || 3
    return this._connect(this._opts, tries)
      .then(client => {
        this._connecting = false
        Object.defineProperty(this, 'client', { value: client })
        this._events.connect.forEach(fn => fn(client))
        return client
      })
      .catch(err => {
        this._connecting = false
        return this._events.error.length 
          ? this._events.error.forEach(fn => fn(err))
          : console.error(err)
      })
  }
  _connect(opts, tries, num = 0) {
    const client = new nkn.Client(opts)
    return new Promise((resolve, reject) => {
      client.onConnectFailed(() => {
        num++
        if (num < tries)
          resolve(this._connect(opts, tries, num))
        else
          reject(new Error(`failed to connect after ${tries} tries`))
      })
      client.onConnect(data => { resolve(client) })
    })
  }
}

and so I guess if I were to include a seed then it would form the same key every attempt and perhaps never connect. So there would be "tls friendly keys". But I do see what you're saying with the multiplex addresses being unique.

basedwon commented 2 years ago

So, I've been digging through the code and playing around and here are my thoughts: You don't want to do anything too complicated here, and so you could almost just recommend using the multi-client and there's a 1 in 4 chance they'll connect.

If you did want to get complicated, you could check the number of successful connections compared to numSubClients and, if let's say only 2 of 4 actually connected, then you could try adding 2 more clients and so on till it gets to 4 connections. I'm not sure if that would interfere with how the session works. And it would be pretty hairy, looking through the code it looks like most of the sub-client build is happening in the constructor which you'd probably want to encapsulate all in its own method.

etc, etc... or just recommend a multi-client of at least 4 or maybe more to ensure the TLS client connects.

I'm working on a package that I should be "finishing" within a week or so that wraps up a lot of the NKN functionality into a "super RPC". I'll let you know when I publish.

yilunzhang commented 2 years ago

Adding clients to multiclient with customized identifier prefix will cause some trouble, mostly because the other people don't know what your identifier prefix is, so __i__ is part of the protocol.

I would say just using multiclient with default (4) or more sub-clients should work. As the number of clients increasing, the chance of having no client connected will decrease exponentially. If a client connects 50% of the times (while we definitely want more than that), then a multiclient with 10 sub-clients will have 99.9% chance to connect.

basedwon commented 2 years ago

Yeah, I thought that might be the case. Simplest solution is usually the best. What kind of limitations have you seen with the number of sub-clients? Is 10 a good number? How many is too many? That probably depends on the device, but what have you observed works well?

yilunzhang commented 2 years ago

We choose 4 as default because based on our test, performance (latency, session throughput) will not increase significantly beyond that point. But we didn't consider TLS success rate (which kinda changes over time) in our test. So if that's your main concern, you might want to choose something more than 4. I don't have any data indicating what number yet, 10 is just an arbitrary I used in previous post...

basedwon commented 2 years ago

When I've worked with WebRTC I think a normal laptop would max out at about 25 concurrent connections. I would imagine that ws would be similar. But, 10 is probably a safe bet -- maybe I'll turn it up to 11 ; )

basedwon commented 2 years ago

@yilunzhang this PR looks good to me, one thing I noticed is that you're still logging the WS error which is actually an Event, you probably don't even need that onerror listener in the _newWsAddr method in client.js

It just logs undefined whenever a WS connection fails

yilunzhang commented 2 years ago

This onerror is very useful when existing ws connect has errors (e.g. network lost, tcp timeout, etc), as you will see when you having a connection for a long time. I will create another PR that adds a onWsError method to client so user can choose how to deal with those ws errors instead of logging them to console.

yilunzhang commented 2 years ago

wsError handler is added in #92

basedwon commented 2 years ago

Yes, I did see. I had tried commenting out the onerror and then it crashed the process. So, good to catch it then.

basedwon commented 2 years ago

So, just to follow up with this, here's a class that leverages a few tricks to guarantee a connection to some nodes and then a send method that indefinitely sends. Basically, making NKN universally and consistently reliable (99.9%)

class Client {
  constructor(opts = {}) {
    opts.numSubClients = opts.numSubClients || 4
    const setClient = async (opts = {}) => {
      const connect = () => new Promise((resolve, reject) => {
        const client = new nkn.MultiClient(opts)
        client.onConnectFailed(() => {
          opts.numSubClients += 4
          resolve(connect())
        })
        client.onConnect(data => resolve(client))
        client.onWsError(() => {})
      })
      Object.defineProperty(this, 'client', { value: await connect() })
      if (opts.onMessage)
        Object.defineProperty(this, '_onMessage', { value: opts.onMessage.bind(this) })
      this.client.onMessage(this._onMessage.bind(this))
      await this.start()
      if (this._onClient)
        await this._onClient()
    }
    setClient({ numSubClients: opts.numSubClients })
    this.start()
  }
  static start(opts) {
    const client = new this(opts)
    return client.start().then(() => client)
  }
  start() {
    return new Promise((resolve, reject) => {
      if (this.client) {
        if (this.client.isReady)
          return resolve()
        this.client.onConnectFailed(reject) // <= shouldn't ever happen
        return this.client.onConnect(({ node, sigChainBlockHash }) => resolve())
      } else if (!this._onClient) {
        this._onClient = () => {
          delete this._onClient
          resolve()
        }
      } else {
        const orig = this._onClient
        this._onClient = () => {
          orig()
          resolve()
        }
      }
    })
  }
  _onMessage({ src, payload }) {
    console.log(`got message from ${src}`, payload)
  }
  async send(addr, data) {
    if (!this.client || !this.client.isReady)
      await this.start()
    return this.client.send(addr, data).catch(err => {
      if (err.toString() === 'Error: failed to send with any client: Error: Message timeout')
        return this.send(addr, data)
      throw err
    })
  }
}

Use it like this:

const client = await Client.start(opts)
// -- or --
const client = new Client(opts)
await client.start()
// -- or even just new (as the send method will wait for the client to be ready)
const client = new Client(opts)

// send will attempt to send until successful
let res = await client.send('nknaddress', 'foo bar')

The start() method is idempotent, so it's called in the constructor but then can be called multiple times and will resolve once the MultiClient has connected.

yilunzhang commented 1 year ago

Closing this issue now as it seems resolved. Feel free to continue the discussion or reopen it. (I really love the discussions here though)

nknorg / nkn-sdk-js

Client only connects 50% of the time using TLS and doesn't reconnect #90