status-im / swarms

Swarm Home. New, completed and in-progress features for Status
92 stars 31 forks source link

Testing cluster #51

Closed divan closed 6 years ago

divan commented 6 years ago

Preamble

Idea: 51-test-cluster
Title: Testing cluster
Status: Draft
Created: 2017-11-29

Summary

Provision test cluster consisting of Status nodes running the simulation of real user behavior. Setup high-level metrics monitoring and track changes between releases.

Vision

The idea stems from https://github.com/status-im/ideas/issues/22 (tools for diagnosing performance regressions). One of the main challenges with it is to simulate real-world load and currently, we have no way to do this. Analyzing performance on a single device is also prone to inaccurate results due to the high variability of hardware, software running in the background and other conditions. We also have no easy to way to gather metrics we want from devices.

This leads to the idea of provisioning a cluster consisting of nodes (status-go, real devices or both), including boot nodes. Cluster may run on its own test network or on existing test network (Ropsten). Each node in the cluster shall be instrumented and configured for metrics collections. Infrastructure for metrics gathering, storing and display should be set up. rand_graph

Using graph visualization tools (like Graphana) it'd be possible to see statistically sound performance measurements, pinpoint changes to release/version changes and easily identify regressions.

screen-shot-2017-09-19-at-12 32 38

Think about this cluster as a Status network playground, where you can deploy, say, 30% nodes with a new change and easily see the difference in performance metrics against stable version. It also enables further possibilities for data gathering and exploration. Example: by collecting stats about each incoming and outgoing whisper message, we can visualize Whisper protocol behavior which may help to build intuition around it and help to debug/develop future versions of the protocol.

Swarm Participants

Requirements

Goals & Implementation Plan

Implementation of this idea has three roughly independent parts that need to be researched, designed and implemented:

Cluster infrastructure

This part should start by evaluating the viable size of the cluster we want to have: 50 nodes, 100, 1000, dynamic? Then, which nodes cluster should consist of: only status-go nodes, real devices/simulators or both.

Then find the best software solution for that. This part requires an understanding of the ethereum discovery process. Solutions like Docker Swarm might be enough, but it might be possible that we'll want to simulate real network topology, for which we'll need to use specialized simulators like Mininet. Each node should probably be isolated using containers, but any isolation alternatives can be evaluated of course. That's unlikely that cluster can run on the modern laptop (it would be awesome though), so the cloud provider should be chosen, whichever easier to work with (AWS/GCP/DO, I guess).

Once the vision of how the cluster should look like is clear, provisioning scripts and tools should be implemented and designed to be developer friendly, with a high level of automatization (again, terraform is probably the right way to go). Ideally, we should be able to deploy as many identical clusters as we wish without any hassle.

In case if cluster runs on the private network, it should setup own bootnodes as well.

Metrics

As the main purpose of having test cluster is to gather data and observe behavior at scale, the code needs to be instrumented to provide those metrics to the metrics collection infrastructure. Here we have two connected parts: code instrumentation and setting up metrics collection infrastructure. Ideally.

Metrics instrumentation

Developers might want to add custom metrics apart from obvious things to measure — CPU, memory, I/O stats, etc. Go code would probably want to report number of goroutines, garbage collection stats, etc, plus many custom things like the number of Jail cells, incoming and outgoing RPC requests, etc.

The task here is to make code instrumentation to be as friendly to the developer as possible: it should be easy to add and test new metrics with the minimal learning curve. One of the examples of such easy approach is expvar Go stdlib package, which might work perfectly for the pull model of metrics. Which model to use (pull/push) is a subject to investigate.

Finally, the instrumented code should not go into production. It can be implemented via build tags, or simply by mocking it with dummy NooP metrics sender, which doesn't change resulting binary code.

Metrics infrastructure

This infrastructure should be a part of cluster deployment, so if there are many clusters, each has its own metrics dashboard and tooling. Essentially it involves metrics collection code, storage (for some period of time) and visualization software. There currently a lot of software to choose from, including Prometheus and Graphana, so the best tools should be chosen here.

Then deployment scripts and code should be implemented. Ideally, it should be (almost) zero configuration for nodes.

Usage simulation

This part consists in developing ways of automating user interaction with Status node and researching of real-world user behavior. First one is more or less simple — provide API to talk to the node, and make it do stuff (send messages, create chats, use dApps, send money, etc). The second one is trickier because effectively it's about simulating the whole economy and humans behavior — simulation code should decide who sends the message to whom, how often, how much money to send, how to use dApps, etc.

Obviously, perfect real-world simulation is unlikely to be achieved, we just need the simulation to have two properties:

Each simulation agent could be independent or controlled by a single node in cluster — subject to investigation, which would be a better approach.

Minimum Viable Product

MVP should consist of:

Iteration N.1

Iteration N.2

Supporting Role Communication

Copyright

Copyright and related rights waived via CC0.

themue commented 6 years ago

Good approach for near-reality approaches. Would start with outlining the requirements for metrics instrumentation and infrastructure. This way we can focus on the right set for reaching the MVP. Scaling later shouldn't be a problem.

themue commented 6 years ago

Puppeth looks like a nice toolset to quickly set up testing clusters. Currently digging deeper into it but there's a lot of inspiration for the tasks here.

divan commented 6 years ago

One of the questions to explore is how much nodes we want to be in a cluster, so it satisfies following properties:

At least, an order of magnitude, 10, 100, 1000, more?

I thought if a number less than 50 is sufficient, it would be possible to buy that number of old Android phones (I saw Galaxy S4 for less then < 60$ on Amazon) and setup cluster from real devices running Status and collect metrics directly from them.

themue commented 6 years ago

Seen this kind of racks for app ratings by bots. :grin: Neat idea as it is more realistic than running on AWS. Sadly not yet have any clue on how complex this would be to control and how to do metrics there. Instead worked several years with clouds.

antdanchenko commented 6 years ago

User behavior simulation:

Infrastructure for users behavior simulation:

Minimum Viable Product:

Requirements:

adambabik commented 6 years ago

I am ready to plage 40h/week for this idea.

adambabik commented 6 years ago

Work is tracked in this project: https://github.com/orgs/status-im/projects/6

oskarth commented 6 years ago

Seems like work has already started on this, great! It looks like it is still in draft mode. It would be good if we could keep this issue up to date in order to achieve: https://wiki.status.im/Status_Organisational_Design

Some questions:

  1. Who is the tester and evaluator?
  2. Who is the other contributor?
  3. Any other roles needed for the swarm?
  4. When is the MVP due (says Christmas but no update here, and still draft, so assuming this didn't happen)?

@adambabik @divan

naghdy commented 6 years ago

Is this swarm still active? Does it have a specific goal to ship something? Feel free to re-open I am mistakenly closing it.