eventbus: log a warning if an event channel is full (and continue to block)

MarcoPolo commented 1 year ago

We had another case pop up where it wasn't obvious that an event channel from the eventbus was full and things were getting blocked. We could be nice here and log something for the user with the tradeoff of paying for an extra timer and select. I'm thinking something like this here.

Before

sink.ch <- evt

After

timer.Reset(time.Second)
select {
case <-timer:
    log.Errorf("Event to %s is being delayed because the event channel is full. Are you doing blocking work? If so, spawn a goroutine.", sink.name, len(sink.ch))
    sink.ch <- evt // still block
case sink.ch <- evt:
}

(and some extra logic around cleaning up and reseting the timer. Ideally keeping this timer as part of the *node)

marten-seemann commented 1 year ago

Setting a timer is quite expensive. We discussed this a few months ago when debugging a similar case, and decided that event bus metrics are the cleaner solution.

Maybe event bus shouldn’t do be async? That would be a pretty big design change, but it would allow us to measure the time it takes to consume an event (subtracting two timestamps is cheap compared to arming a timer).

Wondertan commented 1 year ago

https://github.com/libp2p/go-libp2p/pull/2026

Wondertan commented 1 year ago

Btw, with solution above, the timer is only allocated once the subscription is overflowed, which won't happen in an app that reads notification correctly, so I don't think there is a price penalty for the timer when there are no bugs on app/protocol side.

marten-seemann commented 1 year ago

How so? You always need to set the timer, because you don’t know when the sending on the channel would block.

Really, this is a problem that should be solved by monitoring. We already have a Grafana dashboard that immediately shows you the source of the problem, including line number. This sounds like an excellent use case for Grafana alerting.

Wondertan commented 1 year ago

The block on the sending on the channel is identified with default

sukunrt commented 1 year ago

@MarcoPolo What more info would the log line provide over the libp2p_eventbus_subscriber_queue_length metric?

Wondertan commented 1 year ago

Not all the nodes are running with metrics enabled, while every node is running with logs on. And there can be a case where the Kubo node hosted by PL with metrics allowed does not see any issues while some random node runner experiences it and has no data to bug report. The point is that these two are not exclusive but complementary.

MarcoPolo commented 1 year ago

These are not mutually exclusive. I think we can do something nicer here besides asking every user to setup a prometheus and grafana instance. Especially since this has bitten multiple people who are familiar with the stack.

We can avoid the timer by tracking how long it took to place an event on the channel and if that's over a threshold it'll log an error. That won't prevent against a consumer that never returns, but at least it would highlight a slow consumer.

sukunrt commented 1 year ago

@marten-seemann when we say resetting a timer is expensive, do you mean anything other than cpu time taken? These benchmarks take similar time for me

func BenchmarkTimerReset(b *testing.B) {
    var durations []time.Duration
    for i := 0; i < 100; i++ {
        durations = append(durations, time.Hour+time.Duration(i)*time.Minute)
    }
    t := time.NewTimer(100 * time.Millisecond)
    for i := 0; i < b.N; i++ {
        t.Reset(durations[i%len(durations)])
    }
}

func BenchmarkCounterVecIncrement(b *testing.B) {
    counter := prometheus.NewCounterVec(prometheus.CounterOpts{
        Namespace: "lib",
        Name:      "counter",
    },
        []string{"label1"},
    )
    label := []string{"hello"}
    for i := 0; i < b.N; i++ {
        counter.WithLabelValues(label...).Inc()
    }
}

both of them take about 30ns on m1 mac and 40-50ns on an old core i7(i7-8550U CPU @ 1.80GHz)

Looking at these numbers, I'm fine with a pattern like this, though it is rather ugly.

select {
    case sendch <- evt:
    default:
        t.Reset(time.Second)
        select {
            case <-t.C:
                log.Warn("queue full for %s", name)
                sendch <- evt
            case sendch <- evt:
        }
}

marten-seemann commented 1 year ago

There's a bigger difference when you set the timer to an earlier instead of a later instance, since that will result in a syscall.

The thing I'm worried about is the underlying API question. If we built an API that's so easy to misuse that it requires metrics AND logging, maybe there's something wrong with the API. I suggested switching to a sync API earlier, but of course you can always block a sync callback as well. That applies to any API that uses callbacks.

marten-seemann commented 1 year ago

https://github.com/libp2p/go-libp2p/pull/2383 will make it possible to spin up a local Grafana instance (with all our dashboards pre-installed!) by just running docker-compose up in the dashboards directory. Can't get any simpler than that!

libp2p / go-libp2p

eventbus: log a warning if an event channel is full (and continue to block) #2361

Before

After