libp2p / go-libp2p-pubsub

The PubSub implementation for go-libp2p
https://github.com/libp2p/specs/tree/master/pubsub
Other
321 stars 185 forks source link

On-demand pubsub #332

Open Stebalien opened 4 years ago

Stebalien commented 4 years ago

Currently, pubsub will startup when initialized. Unfortunately, this means:

This is true even if pubsub is enabled but not in-use.

In order to enable pubsub by default in go-ipfs, we need to find some way for pubsub to not take up a bunch of resources when not in-use.

The MVP solution is on-demand startup. We can start pubsub on-demand and stop after some idle timeout after the last subscription closes. This should be fairly simple to implement and will make it possible for us to turn pubsub on by default without significantly and increasing resource usage for nodes that aren't even using pubsub.

The ideal solution is idle peer detection. We don't really need to keep a stream/goroutine open per peer and could instead close streams to peers to which we have not spoken in a while. At the moment this will make the peer think we're dead so we may need a protocol change to implement this correctly.

Geo25rey commented 3 years ago

@Stebalien What do you think of each node starting and stopping pubsub outside of libp2p? For example, there could be ipfs pubsub [start|stop]. It seems like pubsub is an application specific feature, so the applications that need pubsub start and stop the feature.

With this method, I could see there could be a problem if more than 1 application uses pubsub at the same time. I think a good solution to that would be using a semophore in the following way.

semophore = 0
function start_pubsub() {
    semophore += 1
    if (semophore > 1)
        return // pubsub is already running
    else
        // start pubsub...
}

function stop_pubsub() {
    if (semophore <= 0)
        return // pubsub isn't running...
    else {
        semophore -= 1
        // stop pubsub
    }
}

I like this method over doing a timeout since it requires less average CPU time.

Stebalien commented 3 years ago

Unfortunately, I don't trust apps to properly manage a semaphore (e.g., they can crash and never reduce it).

The primary reason for a timeout is to reduce the cost of re-starting pubsub (e.g., application restart and/or configuration change). The connections held open by ipfs pubsub subscribe would effectively act as a semaphore/reference count.

Geo25rey commented 3 years ago

Unfortunately, I don't trust apps to properly manage a semaphore (e.g., they can crash and never reduce it).

I meant the semaphore to be managed by libp2p or go-ipfs, not the application using the pubsub service. I do understand your concern. I forgot to account for if an app doesn't properly run stop_pubsub().

The primary reason for a timeout is to reduce the cost of re-starting pubsub (e.g., application restart and/or configuration change). The connections held open by ipfs pubsub subscribe would effectively act as a semaphore/reference count.

So, the timeout would be relatively short (~10 seconds)? Also, I didn't realize starting pubsub was so taxing. Would adding a "suspend" state be useful?

Stebalien commented 3 years ago

That's what I was thinking (or maybe a minute to be safe?).

When we start pubsub, we'd need to:

  1. Register the protocol with libp2p. Libp2p would then need to tell our peers that we speak the pubsub protocol.
  2. Open streams (in both directions) to all peers that speak pubsub. Unfortunately, the current architecture requires leaving these streams open (which is why I'd like to be able to suspend it).
  3. Send/receive subscriptions from all connected peers.

This isn't terribly expensive, but it's network traffic.

Geo25rey commented 3 years ago

I could see always keeping pubsub started. To suspend after the idle timeout, a request can be sent to all peers to not send anymore data to the opened streams and all data received after the suspend state has started can be ignored. To ignore incoming data, have a relatively small buffer contuinously accept data from a blocking read syscall and do nothing with the data.

Stebalien commented 3 years ago

Unfortunately, that's not really going to help.

RubenKelevra commented 3 years ago

@Stebalien wrote:

At the moment this will make the peer think we're dead so we may need a protocol change to implement this correctly.

Maybe just remove this expectation? There are peers going online and offline like all the time. So "dead" should maybe never be assumed, but instead checked on demand?

We just keep track of peers who closed the connection, and just have an upper limit like 100 peers or so with the age of like 24 hours in the cache, after which they disappear.

This would allow us to do reconnects without wasting much traffic, and just continue at the last state they had.