State of the Bssim - Githubissues

kbala444 commented 9 years ago

@whyrusleeping, @jbenet

As requested, here is a guide to using bssim and some questions I still have about it.

Example Use

Let's say I want to see how samples/star workload is affected by different latencies.

The first line of a workload file is the config line. In the star workload, there are 10 nodes (node_count:10). By default, all the nodes are linked to every other node, and the latency: 5 option sets every link in the network to have a latency of 5ms.

You can also set the options of each link yourself.

The first step is to edit data/config.ini. I want to run the workload with latencies at 0, 10, 40, and 70ms where the bandwidth is a 1000mbps, so I edit the config file to look like this:

config

The latency and bandwidth lists will overwrite the latency and bandwidth fields I specified in the config line of the star workload. I also configured it to show graphs after the script finishes and save them to graphs.pdf.

I then run ./scripts/latbw.sh samples/star. This will run the star workload with the latencies and bandwidths from data/config.ini.

At the end of reach, some basic information about the run is printed:

starresults

(sorry, I forgot to disable f.lux when taking these screenshots)

These were the graphs generated at the end of the script: Here is what I wanted, it takes the mean block time for each for all recorded samples/star runs and graphs it against the latency of each run: latmean

This is one of block times for the latest run over time: bttime

And one for file completions over time for the latest run: filegraph

A full lists of commands and options are in the readme.

The script that makes the graphs is data/grapher.py, so if you want to review the graphs from a workload you ran earlier, you can just run python grapher.py by itself.

Manual links

It's not very realistic to have all nodes in the network have the same latencies and bandwidths, so you can also manually set the link options between two nodes with the [connecting_nodes]->dest_node syntax. This gets super tedious if you have more than like 5 nodes, however, so there's also this tool in the repo called latgen.py. latgen.py generates nodes with random locations in the U.S + England and ouputs these node->node lines with realistic latencies and not very realistic yet bandwidths.

If I wanted to make a star workload using latgen, I could:

cp star starwithconns
python latgen/latgen.py -i samples/starwithconns -b 50 -t star -l

This will create a new workload, samples/starwithconns, that does the same stuff as star, except with a bunch of realistic connections.

The -i flag means insert into workload, which will keep your existing workload except use the manual connections from latgen.
The -b flag sets the average outgoing bandwidth for a node in the network
The -t flag sets the topology of the network, I only have fully meshed and star right now though
The -l flag labels the edges in the graph (which represent links between nodes) with "l, bw" where l and bw are the latency and bandwidth of that link.

Here is the graph generated from latgen:
networkgraph

And the new starwithconns file:

If you only want to run the workload once with the latency and bandwidth specified in it's config line, you can run ./scripts/singlerun.sh samples/starwithconns. If you don't care about graphs and just want the end-of-run stats, you can just run ./bssim -wl samples/starwithconns and then later run python data/grapher.py if you change your mind.

Questions/limitations

Right now, bandwidth is specified per link in mocknet. Each link has it's own bandwidth cap, and a node has a link to each node it's connected to. Shouldn't the total outgoing bandwidth from a node be capped and not the bandwidth per link?

Also general mocknet question, if I want two nodes to be able to send/receive messages to each other, do I need to call both LinkPeers(n1, n2) and LinkPeers(n2, n1) or just one of them?

Any requests/simplifications/questions?

whyrusleeping commented 9 years ago

@heems could we get an 'observed bandwidth' per file transfer field in the output?

Other than that, this is all looking really nice! Are the results reproducible? If so, what is the margin of error between runs? (if not, uh oh :( )

kbala444 commented 9 years ago

could we get an 'observed bandwidth' per file transfer field in the output?

Sure, outgoing bandwidth should be easy, but would require changes to mocknet as well (maybe adding an outgoing bandwidth field to conn in mock_conn.go). Would getting the total bytes sent of a peer over the file transfer time be an accurate outgoing bandwidth? And would this be like an outgoing bandwidth estimation for every active peer for each file transfer?

Other than that, this is all looking really nice! Are the results reproducible? If so, what is the margin of error between runs? (if not, uh oh :( )

Thanks! I'm not completely sure, but I think that if latency is 0 and bandwidth is uncapped, the block time speeds would depend on the speed of your computer. However, if the CPU isn't the bottleneck, the results are reproducible, and even if the CPU is the bottleneck, running the same workloads on the same CPU will give similar results.

I recorded some statistics about bssim (xzibit.jpg) where each workload was run 10ish times. The samples/star workload was run with no latency or bandwidth cap and the samples/starwithconns was run with the existing link settings.

errorstats1

I'm not sure what the acceptable variance for duplicate blocks are in a less trivial workload like samples/viral, but here are the stats for that (ran with latency: 3 and bandwidth: 100).

error2

Sorry for the late reply btw, been sick the last couple of days.

whyrusleeping commented 9 years ago

@heems this is good stuff! The high variance on duplicate blocks is pretty much what i expected, the lowish variance on times is nice though!

How are you measuring the variance on block duplicates? is this across all nodes on all runs? or is it across the same node over a series of runs? The number of duplicate blocks received is going to depend on how many other nodes in the network receive blocks first, so i'm not sure how best to measure that.

At any rate, I like what I see, we're getting really close to the point where we can just start tweaking bitswap and seeing how it reacts :)

next steps:

make sure numbers being reported look right (and that they really mean what we think they mean)
test numbers reported by bssim against real world scenarious (difficult)
automate bssim stuff, maybe in CI?
write unit tests for bitswap using bssim to ensure that its performance doesnt degrade
hack bitswap and make it smart!!

kbala444 commented 9 years ago

How are you measuring the variance on block duplicates? is this across all nodes on all runs? or is it across the same node over a series of runs?

It's across all nodes on all runs, but it should be easy to get it across the same node too if you want that statistic.

make sure numbers being reported look right (and that they really mean what we think they mean)

I'll add more bssim unit tests for sure, but how else would you suggest verifying the numbers?

automate bssim stuff, maybe in CI?

CI == continuous integration? Could you elaborate on this? Sorry, I'm not really sure what you mean.

Yay can't wait to work on bitswap

ipfs-inactive / bitswap-ml

State of the Bssim #8

Example Use

Manual links

Questions/limitations