waku-org / pm

Project management, admin, misc
3 stars 1 forks source link

[Deliverable] DOS protection for req-res protocols and metrics #66

Open chair28980 opened 1 year ago

chair28980 commented 1 year ago

Project: https://github.com/orgs/waku-org/projects/11/views/1

Summary

Current minimum scope applies to implement: - Bandwidth measurement and metrics for per shard traffic. - DoS protecting service nodes by applying request rate limitation on non relay protocols. - This applies some limited bandwidth limitation on those protocols - Provide failsafe mechanism to third party apps / client side help for request rejection mechanisms. **Descoped from this milestone:** As the autosharded public network grows and traffic increases per shard, we want to provide some bandwidth management mechanisms for relay nodes to dynamically choose the number of shards they support based on bandwidth availability. For example, when the network launches, it's reasonable for relay nodes to support all shards and gradually unsubscribe from shards as bandwidth increases. The minimum amount of shards to support would be 1, so the network design and DoS mechanisms (see Track 3) would have to provide predictable limits on max bandwidth per shard. We could also envision a bandwidth protection mechanism that drops messages over a threshold, but this will affect the node's scoring so should be carefully planned. # Epics - [x] #118 - [x] #117 - [ ] https://github.com/waku-org/pm/issues/166
fryorcraken commented 10 months ago

I can see that all issues tagged on this are "not critical for launch" except for feat: Validation mechanism to limit "free traffic" on the network for which I understand there is uncertainty on:

Would be good to review this epic and see if we should postpone it. Or even include it in Gen 0? Or at least, focusing on docs for operators with https://github.com/waku-org/nwaku/issues/1946

cc @jm-clius @alrevuelta @vpavlin

chair28980 commented 9 months ago

Issues de-scoped from Gen 0 milestone:

I propose that we descope the effort to provide a "free tier" of bandwidth in the network for now (a part of Epic 1.3: Node bandwidth mechanism). This would have allowed up to ~1 Mbps of messages without RLN proofs (i.e. publishers don't require RLN memberships), theoretically making it easier for early adopters to trial the tech. However, based on discussions we've had in the meantime and the fundamental unreliability of such a mechanism, I propose we descope/deprioritise work related on this and continue designing the network around mandatory RLN memberships. Let me know if you have strong objections or ideas.

cc @jm-clius

https://github.com/waku-org/nwaku/issues/1938 https://github.com/waku-org/js-waku/issues/1503 https://github.com/waku-org/go-waku/issues/677

Ivansete-status commented 8 months ago

Weekly Update

chair28980 commented 6 months ago

Scope signed-off during EU-NA pm 2024-02-19.

fryorcraken commented 1 month ago

This took longer than expect. Any specific reasons @Ivansete-status ? What is the status? Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled? @richard-ramos @chaitanyaprem for go-waku @weboko for js-waku,

NagyZoltanPeter commented 1 month ago

This took longer than expect. Any specific reasons @Ivansete-status ? What is the status? Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled? @richard-ramos @chaitanyaprem for go-waku @weboko for js-waku,

@fryorcraken, @Ivansete-status: Yes, it was dependent on me, no specific reason in terms of the feature, only other tasks caused a bit of distraction. Several phases were done and redesigns meanwhile. Yes the full feature is part of 0.31.0 release.

weboko commented 1 month ago

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with https://github.com/waku-org/js-waku/issues/2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

NagyZoltanPeter commented 1 month ago

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with waku-org/js-waku#2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

Regarding what exactly? Do you mean the new protocol definition? This one: https://github.com/waku-org/nwaku/issues/2722

fryorcraken commented 4 days ago

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

NagyZoltanPeter commented 4 days ago

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

  • I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos
  • There is less traffic on status.staging. So enabling store rate limit there may not allow us to learn much about impact
  • There are several store nodes in status.prod, so we could enable rate limit for one node, see impact, and then enable for other nodes
  • Waku store performance is not yet resolved. If there is a specific resource that is getting starved, it would be interesting to use this rate limit to help offload starved and improve overall experience to users.

Sorry, I might be not clear. Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration). It is 30 req/1 min for each subscriber peer.

For lightpush and store the default is no-rate limit. We can currently apply one cli config (applies for both) --request-rate-limit and --request-rate-period

If it turns out that we will need different rate limit settings for different protocols. We will need a separate configuration or derive a final value out of it.

fryorcraken commented 3 days ago

Back to roll out strategy, we also want to monitor before/after and extract good value to set from current data.

richard-ramos commented 3 days ago

I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos

It's probably due to the storenode selection heuristic used by the storenode cycle:

  1. We ping all storenodes
  2. Order them by reply time, lowest to highest
  3. Choose randomly a storenode from the fastest 25% (the first quartile of all the storenode replies ordered by rtt ascendant).

Since we only have 6 storenodes in the fleet, it will tend to choose, the fastest 25% (rounded up), will always be the 2 geographically closest storenodes. Since most core contributors are located in Europe, this will show up in the data as if there's a preference for Amsterdam storenodes.

In my particular case, status-go will tend to prefer those in US Central