Rate limits - Githubissues

kim commented 3 years ago

Apply some rate limiting to membership + gossip ingress traffic. Quota defaults are fictional.

kim commented 3 years ago

https://en.wikipedia.org/wiki/Generic_cell_rate_algorithm

kim commented 3 years ago

The semantics is that you'll be disconnected immediately if you send to many membership messages (which is fair, because you're not well-behaving). For gossip, it drops messages, which is not ideal because we're not looking at the payload. It would probably be better to cache origin || payload, which is inevitable a Stale result when received more than once. But then I'm not sure how to make this cache bounded...

FintanH commented 3 years ago

The semantics is that you'll be disconnected immediately if you send to many membership messages (which is fair, because you're not well-behaving)

And then you can re-join if there's the capacity for you again (ala the GCRA)?

which is not ideal because we're not looking at the payload

Are we guarding against stale messages being sent over and over or just too much gossip from one peer in general? If it's the former then could we do a git_has pre-emptively as well as check the limit?

kim commented 3 years ago

And then you can re-join

Depends what the other end does when it gets disconnected. It could also immediately re-connect and flood and get limited and get disconnected.

Are we guarding against stale messages being sent over and over or just too much gossip from one peer in general?

I don't know, because I haven't looked at what those folks are sending.

do a git_has pre-emptively

We do that. It's expensive.

FintanH commented 3 years ago

I don't know, because I haven't looked at what those folks are sending.

Sorry, I meant if the goal was to stop general gossip flooding or just stale messages.

Thanks for the explanations and links :)

kim commented 3 years ago

I meant if the goal was to stop general gossip flooding or just stale messages

Well, it doesn't matter if you think about it: if you receive the same message again, it's stale (ie. you don't forward it). If you receive a lot of different (but valid) messages, then you need to somehow compensate for the load. Unless we add PoW to a message, the best we can do is to avoid I/O by discarding already-seen messages earlier.

kim commented 3 years ago

Pending some history hygiene, this is now ready for another round.

Summary of changes:

The xor filter holding the URNs present locally is shared, such that both interrogation and peer storage have access to it
The max size of the filter is bumped to 100k elements or ~315KiB

It is rather unlikely that we can support 100k namespaces in a single monorepo, so this should be safe for a while =]
Peer storage will only hit the disk if a URN is in the filter (modulo the fp rate of the filter)
The shared filter is rebuilt when namespaces change on disk.

This is implemented by watching reflogs, which we create for the refs managed by librad (and only those). There is obviously a delay between namespace creation and rebuilding the cache, during which the peer will claim to not know about it. That may cause a decrease in QoS in the shotgun model, but is acceptable once #506 lands.
Rate limits are applied as follows:
- Retries when the local storage returns errors are limited aggressively
- Too many consecutive Wants are ignored (again, #506)
- Fetch attempts in response to gossip are throttled to 1/min (fetches are expensive, cannot proceed concurrently, and under fair use updates are infrequent)
- Too many membership messages cause the remote peer to be disconnected (the default value is potentially too lax)

kim commented 3 years ago

FYI: the tests suddenly hanging again seems to have to do with some unfortunate constellation of cosmic radiation or something. I can reproduce on master, so it is not introduced by this series.

kim commented 3 years ago

Pushed to prod, no peers were harmed in the process.

I doubt that. There is a failing test test::integration::daemon::replication::track_peer which apparently relies on issuing a Want having the side-effect of cloning a project. That obviously doesn't work, because the tracking peer will not recognise the Have.

kim commented 3 years ago

Or well, no, scrap that. But the test fails nevertheless.

kim commented 3 years ago

The test passes when waiting for a replicate event twice. I have no clue how that can be, perhaps something introduced in #689?

The remaining test failures are very fun: apparently libgit2 doesn't clear its ref cache fast enough.

radicle-dev / radicle-link

Rate limits #693