waku-org / pm

Project management, admin, misc
3 stars 1 forks source link

[Milestone] Waku Network Can Support 10K Users #12

Closed jm-clius closed 8 months ago

jm-clius commented 1 year ago

Priority Tracks: Secure Scalability Due date: 31 May 2023 Milestone: https://github.com/waku-org/pm/milestone/5

Note: this deadline assumes that the target of 1 Mio users by end-June 2023 could lean for the largest part on the designed solutions for the problem space defined below.

Summary

Tasks / Epics


Extracted questions

Network requirements

Note: this gathers the minimal set of requirements the Waku network must adhere to to support Status Communities scaling to 10K users. It does not propose a design.

1. Message Delivery and Sharding

Note: this section, especially, depends on app-defined user experience minimals. E.g. the app knows what (sub)set of messages is necessary "for a consistent experience" and this will feed into a pubsub topic, content topic and sharding design that does not compromise on UX. This process should also define when messages should be received "live" (relay) or opportunistically via history queries (store).

  1. Nodes should be able to receive (via relay or store) all community messages of the community they're part of.
  2. Nodes should receive live (via relay) all chat messages that is necessary for a consistent experience. A chat message is content sent by users either in a community channel, 1:1 or private group.
  3. Nodes should receive live (via relay) all control messages that is necessary for a consistent experience. Control messages are mostly used for community reasons, with some for 1:1 and private groups (e.g. online presence and X3DH bundle).
  4. Each community can utilize a single or multiple shards for control and community messages, as long as requirements (1) - (3) still hold.
  5. Nodes should participate in shards in such a way that resource usage (especially bandwidth) is minimized, while requirements (1) - (3) still hold.
  6. Peer and connection management should be sufficient to allow nodes to maintain a healthy set of connections within each shard they participate in.

Assumptions:

2. Discovery

  1. Nodes should be able to discover peers within each shard they're interested in.
  2. Discovery method(s) can operate within a single or multiple shards, as long as:
    • requirement (1) still holds
    • nodes can bootstrap the chosen discovery method(s) for shards they're interested in
    • the chosen discovery method(s) does not add an unreasonable resource burden on nodes, especially if this method is shared between shards

Assumptions:

3. Bootstrapping

  1. Nodes should be able to initiate connection to bootstrap nodes within the shards they're interested in.
  2. Bootstrap nodes can serve a single or multiple shards, as long as they can handle the added resource burden.

Assumptions:

4. Store nodes (Waku Archive)

  1. Nodes should be able to find capable store nodes and query history within the shards they're interested in.
  2. Store nodes can serve a single or multiple shards, as long as:
    • they can handle the query rate and resource burden
    • are discoverable as stated in requirement (1)

Assumptions:

5. Security:

  1. Community members should not be vulnerable to simple DoS/spam attacks as defined in (3) and (4) below.
  2. Each community should be unaffected by failures and DoS/spam attacks in other communities. This implies some isolation/sharding in the messaging domain.
  3. Store/Archive:
    • store nodes for a community should only archive messages actually originating from the community
    • store nodes for a community should not be vulnerable to being taken down by a high rate of history queries
  4. Relay:
    • community relay nodes should only relay messages actually originating from the community.

Assumptions:

Other requirements

Note: this gathers the minimal set of requirements outside the Waku network (e.g. operational, testing, etc.) to support Status Communities scaling to 10K users.

1. Kurtosis network testing

A simulation framework and initial set of tests that can approximate:

2. Community Protocol hardening

The Community Chat Protocols specifications are moved to Vac RFC repo.

3. Nwaku integration testing

Nwaku requires integration testing and automated regression testing for releases to improve trust in stability of each release.

4. Fleet ownership

Ownership for infrastructure provided to Status communities should be established. This may require training and transfer of responsibilities which mostly lies de facto within the nwaku team. Fleet ownership comprises the responsibility for:

Initial work

The requirements above will lead to a design and task breakdown. Roughly the order of work:

Ownership for all three items below is shared between Vac, Waku and Status teams:

(1) Agree on requirements above as the complete and minimal set to achieve the 10K scaling goal. (2) A viable, KISS network design adhering to "Network requirements" (3) Task breakdown of each item and ownership assignment

corpetty commented 1 year ago

tagging Testing team: @AlbertoSoutullo, @0xFugue, @Daimakaimura

jm-clius commented 1 year ago

Achieving network requirements: tasks and ownership

NB: requirements and tasks may change as we encounter unknowns. The task breakdown below assumes that nothing more has to be done for Discovery and Bootstrapping other than proper configuration (to be described in the Scaling Strategy BCP).

1. Verify scaling target requirements

Understand the expected:

for 10K community users.

Note: this is not necessarily an analytical exercise but ballpark figures and sanity checking current Status Community message rates. @Menduist has done analysis of message rates in large Discord servers to get to rough estimate of what we would expect to see for Status Communities. However, analysis of existing Status Community shows significantly higher message rate and bandwidth usage. See conversation.

Tracked in: ?? Owners:

2. Community sharding plan

Sharding strategy for Waku relay in general and Status Communities specifically. This plan will consider short term and longer term strategies. This item is set out in more detail in @kaiserd's Secure Scaling Roadmap.

Tracked in: https://github.com/vacp2p/research/issues/154 Owners:

3. Simple Waku Relay DoS mitigation

Strategy and implementation to protect relay and store against simple DoS attack vectors. This item is set out in more detail in @kaiserd's Secure Scaling Roadmap.

Tracked in: https://github.com/vacp2p/research/issues/164 Owners:

4. Scalable storage: nwaku archive PostgreSQL implementation

Already part of https://github.com/waku-org/pm/issues/8 but repeated here for completeness. Note that this includes work to allow concurrent queries.

Tracked in: https://github.com/waku-org/pm/issues/4 Owners:

5. Scalable storage: deterministic message ID

Tracked in: https://github.com/vacp2p/rfc/issues/563 Owners:

6. Scalable storage: testing store at scale

Basic testing to see that PostgreSQL implementation works at expected message and query rates. (Note this is in addition to simulation with Kurtosis).

Tracked in: ?? Owner: @LNSD

7. Filter and lightpush improvements

Revising the RFCs and implementations in nwaku and go-waku. Already part of https://github.com/waku-org/pm/issues/8 but repeated here for completeness.

Tracked in: https://github.com/waku-org/pm/issues/5 Owners:

8. Peer management strategy

RFC for basic peer management strategy and implementations in nwaku and go-waku.

Tracked in: https://github.com/waku-org/nwaku/issues/1353 Owners:

9. Combine into comprehensive scaling strategy

This can be seen as the final goal for all the moving parts and separate tasks listed above. Output will likely take the form of one or more Best Current Practices RFCs that focus on the Status 10K use case. It will bring together the short term strategies for sharding, DoS mitigation, bootstrapping, discovery and store configuration. It may include suggestions on when to use lightpush and filter rather than relay.

Tracked in: https://github.com/vacp2p/research/issues/165 Owners:

10. Targeted dogfooding

This is in addition to simulation with Kurtosis. Individual owners of each task will be responsible for testing and dogfooding their strategies/features. This task ensures that we have considered each item for targeted network testing, including:

Tracked in: ?? Owner: @jm-clius

11. New multiaddrs discovery: libp2p rendezvous

Although it is possible to encode multiaddrs in ENRs, which are currently being exchanged by all existing discovery methods, ENRs are limited in size and can consequently not contain more than one or two multiaddrs. We need a discovery method more suitable for multiaddrs. We have chosen libp2p rendezvous as solution here.

Tracked in: https://github.com/vacp2p/research/issues/176 Owners:

12. Waku static sharding implementation

This is an outflow of the Community sharding plan as specified by @kaiserd and covers the implementation portion, including configuration and enabling shard discovery via ENRs.

Owners:

jm-clius commented 1 year ago

Achieving other requirements: tasks and breakdown

1. Wakurtosis: first network test

This is described in https://github.com/logos-co/wakurtosis/issues/7

It covers testing the scalability of the relay protocol, specifically measuring:

Owner:

2. Wakurtosis: analyze first test results

This step will either confirm our (positive) assumptions about relay scalability or highlight bottlenecks/bugs in the protocol or implementations, which must be addressed and considered in the overall network roadmap.

Owners:

4. Wakurtosis: plan next tests

This is a collaborative task flowing from the results of the first test to refine the simulation(s) and plan the next, most useful tests.

Owners:

5. Community Protocol: move to Vac RFC repo

This is an administrative step. It may require updating the RFC to match the latest implementation, moving sections around, etc.

Owner:

6. Community Protocol: review protocols

Grasping the content of each protocol and how it maps to real-world Waku network traffic. This is potentially an involved task, so the scope should be minimized for this MVP. This relates to Verify scaling target requirements under the Network Requirements.

Owner:

7. Nwaku hardening: Wakurtosis sandbox machine

Provisioning a performant machine(s) which the dev team can use for sandbox testing features using ad-hoc Wakurtosis deployments.

Owners:

8. Nwaku hardening: Wakurtosis integration testing

Integration test environment for nwaku. Most likely it will take the form of a pipeline that deploys a Wakurtosis network topology and runs a series of scripted integration tests for nwaku.

Owners:

9. Nwaku hardening: release automation

Automated release pipeline for nwaku that builds a release, compile release notes and publish release binaries and tagged docker image for most common OSs/architectures.

Tracked in: https://github.com/waku-org/nwaku/issues/611 Owners:

10. Fleet ownership: set requirements

Create a document that summarizes all the common tasks that a fleet owner generally has to do, including deployment, monitoring and debugging. This will also allow us to communicate to other platforms planning on deploying their own Waku fleets what they need to consider. The document should include a section on what Status fleet ownership specifically entails, including a procedure to log and escalate bugs/network anomalies.

Owner: @jm-clius

11. Fleet ownership: training

Based on the requirements determined above, determine who will take ownership of the Status fleets and schedule training sessions.

Owner:

LNSD commented 1 year ago

5. Scalable storage: deterministic message ID

Tracked in: https://github.com/vacp2p/rfc/issues/563 Owners:

waku (protocol): @LNSD nwaku (implementation): @LNSD go-waku (implementation): @richard-ramos

The also known as Message Unique ID initiative progress is tracked in the following issue: waku-org/nwaku#1914

fryorcraken commented 1 year ago

Thoughts on current status:

  • [ ] 1. Verify scaling target requirements

Several discussions have happen. outputs I am aware of are:

@jm-clius @richard-ramos did we have more to this?

This can be closed as static sharding was delivered. The quoted issue also tracks for 1mil.

https://github.com/vacp2p/research/issues/164#issuecomment-1672531792

https://github.com/waku-org/nwaku/issues/1888#issuecomment-1672537221

This needs clean-up. Implementation of MUID to avoid dupe in store is done. Which was the main reason to do it for 10k. Moving forward, we could use MUID for gossipsub seen message logic, is that something we need for 1mil?

Then, MUID is possibly going to be used for Distributed store.

@jm-clius please confirm

  • [ ] 6. Scalable Storage: testing store at scale

https://github.com/vacp2p/research/issues/191#issuecomment-1672542165

@jm-clius were we thinking DST simulation for this?

https://github.com/waku-org/pm/issues/5#issuecomment-1672547298

https://github.com/waku-org/nwaku/issues/1353#issuecomment-1672547801

@jm-clius this seems done. Not sure if we tracked an output somewhere?

  • [ ] 10. Targeted dogfooding

I suggest to descope this from Waku work. By delivering this milestone we enable Status to integrate Waku tech and start dogfooding. We are tracking hardening of Waku protocols as part of https://github.com/waku-org/research/issues/3 with 2.1

https://github.com/vacp2p/research/issues/176#issuecomment-1672550555

  • [ ] 12. Waku static sharding implementation

Done. What issue tracked the work/output? @jm-clius

  • [ ] Setup staging fleet with static sharding for Status dogfooding

Last remaining task. Are we tracking somewhere @jm-clius ? edit: is this it? https://github.com/status-im/status-go/issues/3528

  • [ ] Specify fleet ownerships requirements to enable Status team to maintain own fleet

The other last remaining task. Are we tracking somewhere @jm-clius ?

jm-clius commented 1 year ago

Thanks for revising, @fryorcraken. See my comments below.

Several discussions have happen. outputs I am aware of are: https://github.com/vacp2p/research/issues/177 @jm-clius @richard-ramos did we have more to this?

Afaik many of the suggestions have been implemented or are in the process of being implemented, also in status-go. @richard-ramos may have better idea of current status. Perhaps the work that's being done in status-go should be tracked there, which would mean the Waku side can be closed?

This can be closed as static sharding was delivered. The quoted issue also tracks for 1mil.

I agree.

This needs clean-up. Implementation of MUID to avoid dupe in store is done. Which was the main reason to do it for 10k. Moving forward, we could use MUID for gossipsub seen message logic, is that something we need for 1mil? Then, MUID is possibly going to be used for Distributed store.

Yes, I would close https://github.com/vacp2p/rfc/issues/563 as the only issue really needed for the 10K milestone. We also don't need to do anything else for the 1 mill milestone, but we can keep https://github.com/waku-org/pm/issues/9 open to track the work that would be necessary for the distributed store.

https://github.com/vacp2p/research/issues/191#issuecomment-1672542165

@jm-clius were we thinking DST simulation for this?

Initially, yes. But I think a reasonable step for the 10K epic would be (a) dogfooding and (b) local stress-testing of postgresql.

  1. Combine into comprehensive scaling strategy https://github.com/vacp2p/research/issues/165 @jm-clius this seems done. Not sure if we tracked an output somewhere?

Yes, I've gone ahead and closed the issue. The output here was just moving the RFCs to vac repo and revising them.

  1. Waku static sharding implementation Done. What issue tracked the work/output? @jm-clius

Main tracking issue was: https://github.com/waku-org/pm/issues/15 which I think can just be closed. There were also tracking issues in nwaku (and probably go-waku/js-waku).

Setup staging fleet with static sharding for Status dogfooding Last remaining task. Are we tracking somewhere @jm-clius ? edit: is this it? https://github.com/status-im/status-go/issues/3528

No, the first fleet that can be used for initial tests/dogfooding is tracked here: https://github.com/status-im/infra-waku/issues/1 Since this fleet has been deployed, this issue can probably be closed. This is not quite a staging fleet for Status yet, which I'll link to the issue I create for the Status fleet requirements below.

Specify fleet ownerships requirements to enable Status team to maintain own fleet The other last remaining task. Are we tracking somewhere @jm-clius ?

It is now: https://github.com/waku-org/pm/issues/61 Not a very detailed issue, but should do the trick. :)

richard-ramos commented 1 year ago

I think suggestions from: https://github.com/vacp2p/research/issues/177 have not been implemented, or I could not find them on status-go code.

fryorcraken commented 1 year ago

@jm-clius https://github.com/status-im/infra-waku/issues/1 tracks for "auto-sharding" I assume you mean it can also be used for static sharding dogfooding.

fryorcraken commented 1 year ago

Weekly Update

All software has been delivered. Pending items are:

fryorcraken commented 1 year ago

Monthly Update

Staging fleet for Status (static sharding + Postgres) has been defined and handed over to infra: https://github.com/waku-org/nwaku/issues/1914 Stress testing of PostreSQL in progress, INSERT done, SELECT in progress.

fryorcraken commented 11 months ago

1k nodes simulation blogpost: https://github.com/vacp2p/vac.dev/pull/123/

fryorcraken commented 10 months ago

Weekly Update

fryorcraken commented 10 months ago

Weekly Update

jm-clius commented 10 months ago

Weekly Update

jm-clius commented 10 months ago

Weekly Update

jm-clius commented 10 months ago

Weekly Update

jm-clius commented 9 months ago

Weekly Update

fryorcraken commented 9 months ago

We will run one more week of internal dogfooding of static sharding + PostgreSQL in Status Communities. Once done and if no new issues are found. We will close this issue.

The go-waku and waku chat sdk team will continue to support Status with their integration of Waku v2 but no major effort is scheduled in term of software development and testing.

jm-clius commented 9 months ago

Weekly Update

fryorcraken commented 8 months ago

https://github.com/waku-org/pm/issues/97 is now done. Status QA is proceeding with testing. Most changes are now focused on status-go with ad hoc bug/issue investigation from Waku team. This Milestone can now be closed :tada: