Basic simulation testing of data sync

oskarth commented 5 years ago

(From https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md#simulation-1-1-1-chat-basic)

Simulation 1: 1-1 chat (basic)

Two nodes talking to each with 10% churn each. That is: 10% probability to be online at any given time. When online, a node is online for X time. For simplicity, let's assume X is 5 minutes. This can be a parameter and be varied up to connection windows as short as 30s.

Send 5 messages each.

Answer the following questions:

What's the bandwidth overhead? Expressed as multipler of 10 messages. E.g. if you try to send message one message 3 times and get 1 ack, that's x4 multipler.

What's the latency? Expressed as ticks or absolute time until a node has received all messages sent by other node. Alternatively: expressed as a distribution of average or median latency, along with P90 latency. I.e. what's the latency for the 90%th delayed message?

Acceptance criteria

Dockerfile or something that allows us to launch two independent processes that does the above
Specific parameters by default above, but up for debate
Doesn't need to run on separate "physical" nodes in v1

Future work

Simulation 2-7 in https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md#proof-evaluation-simulation
Pending specific numbers for changes/more stress testing
Simulate over multiple nodes, or introduce general network latency/drops
Ensure path for many more separate processes (~100?) => cluster solution? TBD

oskarth commented 5 years ago

^ @jakubgs @decanus @adambabik

jakubgs commented 5 years ago

So what is already available to make this happen? Where is the implementation? Do we have some kind of CLI tool for sending messages? @adambabik you would probably know were we stand on this. I'm happy automating tests and measuring of the tests, but I'm not sure where to start on sending of messages.

decanus commented 5 years ago

@jakubgs As of today the CLI implementation works, so starting CLI clients and having them communicate with each other is the way to go.

jakubgs commented 5 years ago

Oh, cool, I'l play with it then.

decanus commented 5 years ago

https://github.com/status-im/status-console-client/pull/34 check out this PR @jakubgs

StatusWrike commented 5 years ago

➤ Oskar Thoren commented:

Jakub Sokołowski is it doable for the basic test to be up and running this week? how much effort do you think it requires? do you need anything else?

jakubgs commented 5 years ago

Yeah, I'd like to now how I can send messages with the CLI agent without using the GOCUI thing. I can see there's a -no-ui but it seems to just take away all interactivity. I can also see it's listening on 30303 and that's it, so i see no control port for RPC or anything like that. How can I send a message with this without having to use the CUI?

jakubgs commented 5 years ago

I was kinda expecting -no-ui to allow me to send messages by providing them though just stdin, but that doesn't seem to do anything. And this doesn't seem to have any commands to selecting a channel or contact to talk to.

jakubgs commented 5 years ago

BTW, this client is really cool actually. If I could somehow use my usual account with this(how can i extract my own private key hex?) and it had a few extra commands like /join, /leave, and /msg, i could totally switch from the abysmal desktop client. Though what would really kill it is a fuzzy search of channels in the left pane for easy switching.

StatusWrike commented 5 years ago

➤ Jakub Sokołowski commented:

And as to your question Oskar, it's definitely doable if I have a way to script using the command line tool, but as it stands it seems to require human-like interaction.

oskarth commented 5 years ago

Agree that'd be useful and probably a requirement. @adambabik @dshulyak can we do this?

dshulyak commented 5 years ago

not sure if command line api is a convenient way to implement such testing. to simulate configured churn higher level orchestration will be required. things like bandwidth overhead require in-app instrumentation and tooling to get meaningful data from that simulation.

i had this tool for whisper and discovery testing. basically it had following modules:

docker shim to bootstrap cluster from our images and pass specific configuration
client to get metrics from statusd process (it was printing them in ascii after the test)
add packet loss or latency for targeted endpoints

note that it is slightly outdated due to changes in status-go and tests are written in golang. simple example is here https://github.com/status-im/status-scale/blob/master/tests/example_test.go

jakubgs commented 5 years ago

That's a fair point, and I'd agree, if we want more in-depth data than writing a tool with the explicit purpose of being for doing simulations like this is the way to go, kinda like https://github.com/status-im/simulation that @divan did some time ago.

But if all we want is some simple measurements than this client could be enough. All you'd need to orchestrate it sending messages is a way to send commands via stdin in the -no-ui mode, this we we could have a separate tool that just pushes commands like /join and /msg and generates the load we want at the rates we want.

Based on the description of this issue, the questions we want to answer are:

What's the bandwidth overhead?
- Expressed as multipler of 10 messages
- E.g. if you try to send message one message 3 times and get 1 ack, that's x4 multipler.
What's the latency?
- Expressed as ticks or absolute time until a node has received all messages sent by other node.
- Alternatively: expressed as a distribution of average or median latency, along with P90 latency. I.e. what's the latency for the 90%th delayed message?

Those metrics sound like the would require some code changes in this client to expose things like latency.

dshulyak commented 5 years ago

That's a fair point, and I'd agree, if we want more in-depth data than writing a tool with the explicit purpose of being for doing simulations like this is the way to go, kinda like https://github.com/status-im/simulation that @divan did some time ago.

that tool has a different use case. at least in my understanding, this test requires cluster setup, measuring and simulation of the various conditions (like configurable churn, or increased packet loss). the latter makes all the difference. it is very easy to implement them when you are simply writing a test in a programming language. otherwise, tool itself will stand in your way to test what you want.

also, it runs a simulation using only homogeneous clients, in real tests different nodes may perform different functions and it is beneficial to control configuration from the test itself.

anyway, what I suggested is to treat this simulation as regular tests written in a programming language with tooling to help in the various areas.

for instance, consider how req/rep latency can be measured on the client. a client has to send and then poll either on the same side for rep or on the other for req and track time. it is trivial in any programming language and the resulting code will be easy to read and maintain. what if i want to track how fast message replicates to 10 clients? easy change.

the same thing for churn, trivial to implement when a test is written in programming language and test can control the behaviour of the clients.

Those metrics sound like the would require some code changes in this client to expose things like latency.

it depends on what kind of latency was meant here. if it is for receiving a message there is no way to measure it without knowing id of the message, hence it has to be measured by receiver itself.

if rtt has to be measured - then maybe sender can wait for an ack, in this case client can measure it and expose

bandwidth has to be exposed, but i think both devp2p ingress/egress bytes are already exposed with metrics.

jakubgs commented 5 years ago

I guess the question is how in-depth we want to go. If this simulation is a one-off thing and we just want to answer only the questions @oskarth posed, than it might be fine to just make a few small changes to this client and use it to do the simulation, but if we want more metrics, more control, easier ability to change the parameters of the simulation, then developing a dedicated test suite as you described it would make sense.

As it stands I can do neither. I can't use status-console-client for the simple simulation right now because I have no simple way of scripting it's behavior, nor can I do what you suggest, because I lack bot the context of how the protocol and its implementation works and the skill to write something like that in Go. I could try, but it would take considerable amount of time.

decanus commented 5 years ago

@jakubgs We should definitely opt for more extensive simulations with more metrics, control etc.

oskarth commented 5 years ago

Good discussion. I think both basic simple testing and more extensive testing with scaffolding etc are useful.

@dshulyak do you want to try to resurrect that framework and integrate into status-console-client so we can use it?

I did some super basic testing just to get ballpark answers re BW multiplier latency. Essentially just using mvds standard simulation and grep. https://notes.status.im/7QYa4b6bTH2wMk3HfAaU0w#

tldr: x100 BW overhead and 5m latency to receive message.

This shows the most naive results, and it is obviously not great. To make it more realistic, I suggest seeing what the results look like with mailservers (or Swarm, but don't have that ready yet) etc. @adambabik this is integrated into status-console-client right now?

@decanus Can we expose the basic simulation code in mvds into status-console-client so we can run it over Whisper+Mailsver and get similar simulation output log? This way we can get similar naive numbers for a more realistic setup. In parallel, we can work on more extensive test scaffolding etc.

Another thing: The logs seem to be on a 'per message' as opposed to a per payload, so perhaps it'd make sense to reflect payloads being sent in logs? I.e. 10 ACKs to a node isn't 10 messages but one payload.

adambabik commented 5 years ago

@oskarth yes, mailservers works with status-console-client.

Regarding the simulation tooling, I think it's a good idea but it will take some time. Also, it would be great if we can make generic so that no changes in status-console-client are required to put it into this simulation framework. We can base it on a docker container like it was mentioned.

I still plan to add two things:

bash script so that I can easily run two nodes and communicate between each other,
write a e2e test in Go which will run two nodes in memory and exchange messages through the network.

StatusWrike commented 5 years ago

➤ Oskar Thoren commented:

Dean EigenmannJakub Sokołowski do you want to fork this issue in some way? Right now it's not very clear to me who will do what (you are both assigned here).

Adam BabikDmitry Shulyak likewise, if you could fork into separate issues from your POV things that you personally want to do and link there that'd be great.

If some task feels it should be done but it isn't currently a priority/something you are able or interested in doing, please note so explicitly so we know where we have gaps.

corpetty commented 5 years ago

I have recently spoke with Trail of Bits (CCing @lojikil) about V1 audits and whisper came up. They have been investigating it by happenstance and lack a good client to do tests with.

It behooves us to work together on this, and either push this repo as that client (with a few QoL modifications) or to build the test framework (how much work is this???)

Stefan can you give us some requirements you would need for security testing?

dshulyak commented 5 years ago

@dshulyak do you want to try to resurrect that framework and integrate into status-console-client so we can use it?

i was planning to use this framework for testing new chat API that will be used by status-react. but i can spin up some temporary api for tests.

tldr: x100 BW overhead and 5m latency to receive message.

what mvds does when it can't receive an ack for sent message? it will try to retransmit until ack is received or it will stop at some point?

dshulyak commented 5 years ago

Regarding the simulation tooling, I think it's a good idea but it will take some time. Also, it would be great if we can make generic so that no changes in status-console-client are required to put it into this simulation framework. We can base it on a docker container like it was mentioned.

some changes will be required though. like exposing metrics that we want to collect, or adding an api that will be used for tests.

jakubgs commented 5 years ago

@oskarth

do you want to fork this issue in some way? Right now it's not very clear to me who will do what (you are both assigned here).

Agreed, and based on what Adam said here:

it would be great if we can make generic so that no changes in status-console-client are required to put it into this simulation framework

Here's a proposal of splitting this into 4 steps/tasks:

Extracting the pure console client code as a library to a separate repo
Adjusting the library API to also meet the requirements of Trail of Bits security audit
Refactoring of status-console-client to use the new library
Use the newly created library to create a testing/simulating framework in a separate repo

Does this make sense? These do sound like distinct tasks, but they don't seem easily parallelizable.

decanus commented 5 years ago

@jakubgs it seems like once the first step is complete, some of the other ones are parallelizable?

jakubgs commented 5 years ago

Possibly, depends on how much work there would be to adjust the library to allow Trail of Bits to do their audit, and how much that would change how status-console-client interacts with it. At the very least the last two should be able to go in parallel.

dshulyak commented 5 years ago

there are api methods to interact with messenger here https://github.com/status-im/status-console-client/pull/92, you can also use curl see API.md. it will be merged soon, probably tmrw, but if you want you can use that branch for tests the way you see them

and https://github.com/status-im/status-scale/pull/29 a test example with this api and exported metrics. if needed you can introduce additional latency between peers and/or packet loss. will share example later

StatusWrike commented 5 years ago

➤ Nabil Naghdy commented:

Is there an updated ETA on this? Not much activity in the GH issue since 2 weeks ago

dshulyak commented 5 years ago

I can make measurements. I tried to use -ds but i can't make it deliver messages. Can someone confirm that all is good with ds in current master?

I am running client binary in isolated cluster and the only difference between "naive" client from mvds is the absence of -ds flag. Everything else is identical.

oskarth commented 5 years ago

Thanks @dshulyak. @decanus @adambabik could you please comment?

Also I suggest we split this issue up into simple and more advanced. @adambabik did you do this somewhere else? If so let's link this. Currently this issue is too big in scope. As discussed, what we need right now is just https://notes.status.im/7QYa4b6bTH2wMk3HfAaU0w# naive style 5m test but with mailservers in production (and probably batch mode).

adambabik commented 5 years ago

@oskarth can we define first what we want to achieve? I seriously don't quite get the goal of this new simulation and also the old one.

We have https://notes.status.im/7QYa4b6bTH2wMk3HfAaU0w# which simulates communicating between 3 nodes while 90% of packets each node tries to send is skipped. It turned out that the latency is 5 minutes. What do these 5 minutes mean, that it took 5 minutes that two nodes were online at the same time to exchange the necessary packets? Is this result random or it's always around 5 minutes? How does this change when I increase the number of nodes or reduce the offline time?

I am not sure what mailserver changes here. With the mailserver (or whatever storage technology we use), you need to just assume you get all packets (unsorted) at once from a given period of time, process them by the MVDS and examine the result. This can be done without actual mailserver integration which in my opinion is totally unnecessary at this stage. You can perfectly simulate this scenario in https://github.com/status-im/mvds/ first.

If we want to have a simulation with Whisper transport and mailserver support, sure, we can do this. But it will be a different simulation which will include Whisper and mailserver latencies. It's not comparable with the current simulation you have.

However, I also have a problem with -ds. I don't get the messages so we need to dig into this problem first.

Also, I am sorry, I did not respond earlier because I could raise my concerns earlier and we could move with this faster :/

dshulyak commented 5 years ago

We have https://notes.status.im/7QYa4b6bTH2wMk3HfAaU0w# which simulates communicating between 3 nodes while 90% of packets each node tries to send is skipped. It turned out that the latency is 5 minutes. What do these 5 minutes mean, that it took 5 minutes that two nodes were online at the same time to exchange the necessary packets? Is this result random or it's always around 5 minutes? How does this change when I increase the number of nodes or reduce the offline time?

i thought that it took 5m to deliver message since the sender posted that message. e.g. there are at least 2 actors, one of them sends the message, that time is saved and then we check the difference when the other side receives message from the network. at least this is that i did in status-scale.

that number will be growing with offline period, e.g. if node is offline for 24h and online for 1m then latency will be ~24h,- if there is a mail server in the network. if there is no mail server then i doubt that anything will be ever received with such low online window. this is not really great metric itself for sure. i wanted to see what is the bw overhead with ds, i assume it has some logic to re-transmit messages. original bramble will stop retransmission's to the peer if there is no direct connection with a peer, how does mvds decides when it was enough to re-transmit? and how it resumes?

oskarth commented 5 years ago

@adambabik it means in one naive test run with nodes 10% online (2h a day) we got 5m latency and x100 BW increase for 1:1 chat. While this provided reliability over time, clearly this isn’t good enough.

Realistically, we are still going to have mailserver for v1. So we want to see what the naive run looks like in terms of bandwidth and latency. With similar minimal naive assumptions (one or a few test run or so, 10% end node uptime, etc). Running it over actual network allows us to tease out other things that we might not think of in advance. In this sense, having mailserver latency as an unknown is actually a plus.

As to whether it is 5m always I don’t know, but it seems roughly right. It’d be easy enough to check - just run five times and remove two outliers. Let’s not worry about more nodes for the moment, before we have basics right.

@dshulyak you can see the retransmission logic in the spec here https://github.com/status-im/bigbrother-specs/pull/17#pullrequestreview-259860819 essentially it is exponentially decaying, but easy enough to tweak. After the minimally viable part we can see if it makes sense to use some other scheme for retransmission, such as using online indicators or what not. This is decoupled and easy enough to tweak as we deploy it, IMO.

cc @decanus re -ds flag issue

dshulyak commented 5 years ago

As to whether it is 5m always I don’t know, but it seems roughly right. It’d be easy enough to check - just run five times and remove two outliers. Let’s not worry about more nodes for the moment, before we have basics right.

@oskarth how it can be 5m always? if peer is offline for 24h how delivery latency can be less than this time? it sounds like you measuring this latency as if every peer was online with a chance of 10%, e.g. message is sent and it has 10% to be delivered. i don't think that this approach reflects how clients are used.

@dshulyak you can see the retransmission logic in the spec here status-im/bigbrother-specs#17 (review) essentially it is exponentially decaying, but easy enough to tweak. After the minimally viable part we can see if it makes sense to use some other scheme for retransmission, such as using online indicators or what not. This is decoupled and easy enough to tweak as we deploy it, IMO.

thanks, i understand that it can be exponentially decaying. the difference with brumble is that you know for sure that peer is offline (direct connection) and thus you have a clear heuristic to stop/resume retransmission. with whisper you effectively never know if another peer online or not, unless you are exchanging periodic heartbeats. just curious how it is possible to tweak that?

StatusWrike commented 5 years ago

5m seems roughly right given 10% random uptime assumption. Of course it’d look different if you are offline for 24h. We can do more extended testing with other parameters but this seems like the most basic KISS one to me to get done first. One useful extension would be longer offline periods as well as more “sticky” behavior (on 23h, off 1h), but this seems like enhancements.

That’s true and fair. That’s why we want remote logs eventually to make it less sensitive to offline devices. Also if you have a different network topology you might have something resembling direct connections. Or pings. Or other heuristics.

On Wed, Jul 10, 2019, at 12:35, Dmitry Shulyak wrote:

As to whether it is 5m always I don’t know, but it seems roughly right. It’d be easy enough to check - just run five times and remove two outliers. Let’s not worry about more nodes for the moment, before we have basics right.

@oskarth https://github.com/oskarth how it can be 5m always? if peer is offline for 24h how delivery latency can be less than this time?

@dshulyak https://github.com/dshulyak you can see the retransmission logic in the spec here status-im/bigbrother-specs#17 https://github.com/status-im/bigbrother-specs/pull/17 (review) essentially it is exponentially decaying, but easy enough to tweak. After the minimally viable part we can see if it makes sense to use some other scheme for retransmission, such as using online indicators or what not. This is decoupled and easy enough to tweak as we deploy it, IMO.

thanks, i understand that it can be exponentially decaying. the difference with brumble is that you know for sure that peer is offline (direct connection) and thus you have a clear heuristic to stop/resume retransmission. with whisper you effectively never know if another peer online or not, unless you are exchanging periodic acks. just curious how it is possible to tweak that?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/status-im/status-console-client/issues/61?email_source=notifications&email_token=AL4LZZUTZKXJO75IAXAUNX3P6VRHTA5CNFSM4HP35WO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSIHJQ#issuecomment-509903782, or mute the thread https://github.com/notifications/unsubscribe-auth/AL4LZZW2CH5P6GDV2HEUZBLP6VRHTANCNFSM4HP35WOQ.

StatusWrike commented 5 years ago

➤ Oskar Thoren commented:

To be done here, more precise task splits incoming https://github.com/status-im/mvds-simulations

dshulyak commented 5 years ago

@oskarth can you clarify why this measurement makes sense to you? just want to understand the thought process. is it something commonly used?

i see that you highlighted random uptime, i just not sure what to make out of this measurement. does it mean that for any random participant in the network average latency will be 5m and bw overhead x100 over time?

when i think about delivery latency and potential bw overhead i want to know usage pattern (online/offline ratio and how often peers goes online), based on this usage pattern i can do measurements using small period of time (e.g. not practical to run a simulation with a period of 24h) and then extrapolate it for a longer more realistic periods. but the longer the period the longer will be the latency, bw overhead ratio may decay because of the exponentially increasing timer for retransmission, but how latency can stay constant? isn't it just intuitively wrong?

oskarth commented 5 years ago

It doesn't really matter. It's just the simplest thing that isn't completely wrong and doesn't add unnecessary assumptions. The point is you can test it in a few minutes, unlike the two months this issue has been open with little progress.

If there's another method that can be done and we can use this to gain confidence in our trade-offs that's fine. The point is to ship something sooner rather than later and then iterate with sophistication.

dshulyak commented 4 years ago

If there's another method that can be done and we can use this to gain confidence in our trade-offs that's fine. The point is to ship something sooner rather than later and then iterate with sophistication.

I mentioned earlier that i could run this test (https://github.com/status-im/status-scale/blob/master/tests/clients_test.go) to understand bw overhead. The only issue was that console client api couldn't get messages from data sync. It outputs smth like this https://github.com/status-im/status-scale/blob/master/RESULT.md

oskarth commented 4 years ago

@adambabik @decanus did you guys figure out why -ds flag stopped working?

decanus commented 4 years ago

@oskarth I am not sure it may be due to the larger architectural overhaul. I will try and figure it out on monday however.

status-im / status-console-client