Collecting statistics - Githubissues

See #65 for complete history

Original comment by @davinci26

Hey y'all,

Thanks for the project it helped me a lot!

As discussed in IPFS issue board IPFS Performance #5226, I made some changes to the IPTB framework to generate measure the performance of IPFS and generate performance graphs. In detail I added the following functions in the framework:

iptb make-topology: This creates a connection graph between the nodes (e.g. star topology, barbell topology). In the topology files empty lines and lines starting with # are disregarded. For non empty line the syntax is origin:connection 1, connection 2 ... where origin and connections are specified with their node ID.

iptb dist -hash: The simulation here distributes a single file from node 0 to every other node in the network. Then it calculates the average time required to download the file, the standard deviation of the time, the maximum time, the minimum time, the duplicate blocks. The results are saved in a generated file called results.json.

I also added a Python3 script to plot the results that adds an additional optional dependency in the project, Matplotlib.

Finally I created a readme file, simulation.md, that explains the logic of the simulation. I also added there the response of @whyrusleeping to the issue Simulate bad network #50 so people would know that there is support for bad network.

I would appreciate your feedback and any improvements suggestions :)

Originally this work was done to iptb prior to the transition to using plugins. After the transition, we wanted to provide a generic way to handle the implementation of iptb dist, which was basically recording timing information around a RunCmd call.

The solution to this was to add a generic way to capture stats around iptb run by recording execution time, and calculating different stats to be reported at the end.

However, I think recording and reporting the elapsed execution time is probably a useful enough thing on it's own that we should probably just add it to everything that uses the generic reporting. If we are exposing the elapsed time as output, I think it provides enough information to calculate different statistics outside of iptb itself.

There are two other piece though that I think also need to be touched on

1) Parsing output Parsing generic output is not always ideal, we might be able to solve this really simply by supporting different encoding for the output. At first just text or json. 2) Collecting metrics Currently, using iptb metric is the only way to do this, and for the most basic metrics this works okay, as a user can run the collection before, and after. This type of collection only works for accumulated metrics, such as bandwidth, or some other metrics which aren't of a realtime nature.

Real time metrics are another thing (cpu, ram, etc) and I'm open to discussion around these.

To summarize, I think a simple approach to supporting this use case at first is to add a elapsed time for all outputs along side the exit code, and adding the ability to return output as json. Metrics can be collected independently as the user sees fit.

I think that designing features such as #75 #76 #77 moves IPTB into a direction that puts a lot of additional weight on IPTB.

General Thoughts

My thoughts on the subject as a user of the project and as a developer in general is:

I think there is a room for such features because there are a lot of projects (OpenBazzarr etc) that want to measure the performance of IPFS and add it as a component of their system #50 #26 . I also got involved to the project to measure the performance of IPFS because it was a crucial component in my system. The question that remains is if the core development team want to take this burden or choose to leave it to the users. For me both options have benefits and it highly depends on the time that core devs have available. I understand that you may want to spend time developing time to develop/improve IPFS/libp2p than IPTB. It's a decision that core devs should make since they have a more holistic view on IPFS milestones. Personally, I trust you to make a good decision.

Output

I agree as far as the elapsed time is concerned, the current implementation of elapsed time is robust. I would prefer having the output as a JSON file or txt file after the individual results as it makes it easier to parse.

Something like this:

iptb run -- ipfs id --format="<id>"
node[0] exit 0

QmXschyVzLmS4JqPN1kuhCXTjau2oQkVuzjvTbQFTGm3w3
node[1] exit 0;

Qme3h8WwfpBiPHPfdEs9GuegijVhaBX9xYPXTTDAS6uciR
node[2] exit 0;

Time Results: {Specified format}

This will make the parsing from other programs easier compared to having the individual results per node.

Metrics

For real time metrics it would be more reasonable to be produced by the plugin and just have an interface on IPTB. As discussed on IRC maybe IPTB could request/get heartbeats from the pluginin that contain real-time metrics and forward them to the user. If you are interested in a design like this I can take a more detailed look at what this would look like and post my findings here to iterate on the design
Additional issue with the metrics is that currently you can collect only one metric instead of multiple (correct me if I am wrong)

Stats

Providing the basic stats based on elapsed time is basically a free primitive in terms of development and computational cost from my perspective. The same does not hold for calculating stats on metrics. Additionally it could be used to automate the bencharmaking of plugins instead of having custom benchmarking by everyone.

cc @dgrisham

@davinci26 thanks for writing all of this out! I want to respond to it all, but won't be able to for 12 hours.

I did want to comment quickly though about the output

I would prefer having the output as a JSON file or txt file after the individual results as it makes it easier to parse.

I want to provide an easy way to parse, but I don't want to mix that with the human readable text if we can. One way to solve this would be to support an output encoding (ex: iptb --enc json run -- <cmd>), which would output everything encoded to something that could be parsed easily.

One of the things I did like about the original idea for a "stats" flag, was it provide an easy way to get just the stats out without also interfering with the other output of the command.

It actually provided a really interesting way to interact with iptb for stat gathering purposes.

I wrote a small python script which would read from stdin (could be any file I guess), and parse each line and calculate some basic stats.

To connect it up to iptb, I made a named pipe. Every iptb command I ran would print the stats out to the named pipe.

On the other end of the pipe was the python script. So for every command I ran through iptb, it would print the stats in another window.

(Example)

$ mkfifo stats
$ iptb run --stats ./stats -- ipfs id

In another window

tail -f ./stats | python stats.py

This provides a really easy way to collect some output and run whatever calculations you want over it. I'm just not sure exactly what we want to be in the output, or if this is exactly the way to do it.

One possibility is to have a event logging around the Core interface which would provide a much more detailed look into what is happening everywhere around the plugin. This would be a much more generic implementation and I think would provide users with almost everything they would need, or at least in an easy to extend way. Basically what method on the plugin that was invoked, and what it was called with.

Script

import sys
import statistics
import json

print("MEAN\tSTDEV\tVARIANCE\n")
for line in sys.stdin:
    try:
        jline = json.loads(line.rstrip())
    except ValueError:
        continue

    nums = [o['elapsed'] for o in jline['results']]

    mean = statistics.mean(nums)
    stdev = statistics.stdev(nums)
    variance = statistics.variance(nums)

    print('{:.2f}\t{:.2f}\t{:.2f}'.format(mean, stdev, variance))

Some thoughts (will respond with more as things percolate):

+1 for putting more weight on IPTB, at least where it makes sense. Scripting around IPTB can be great as with any program, but there are areas where it gets painful. That's one reason I ended up here -- after trying to manage asynchronous executions myself, I realized that it would be easier and more reliable to integrate into IPTB, and it made even more sense considering 1. that Go is good at async, and 2. iptb run was already doing asynchronous runs for single command cases.
Regarding @travisperson's comment on event logging (I'd quote it directly, but I'm not sure how to do that in a bulleted list...) could we have it so that plugins could access the logging module directly and add to it? E.g. ipfs has various subsystems for logging (e.g. you can do ipfs log level engine debug (general form ipfs log level <subsytem> <level>) to modify the engine subsystem's log level) -- maybe we could have something like that, but <subsystem> is replaced by <plugin>, then users can log all they want within their plugin.
I definitely think json output would be nice, and agree with @travisperson's comment about not mixing human readable + parsable output.
Providing the primitive of timed output of execs seems nice to me as well, as mentioned by @davinci26. It's also very natural given that one of IPTB explicit purposes is to run commands. However, it depends on the exact level we want these timings at -- e.g. there may be additional overhead in calling iptb run on a dockeripfs node vs. a localipfs node. If we measure that at the level of IPTB, then we might be getting extra time in the dockeripfs case. So it might make more sense to have plugins at least implement how timings take place (e.g. in the case of dockeripfs, maybe the docker exec args get wrapped in something that times the user's command), then maybe IPTB sets up the interface for those timings like it does for run/etc. That seems to make sense to me given my understanding of how people want to use IPTB. Let me know if these points make sense or if I'm misinterpreting something, though.
I don't think IPTB should be doing calculations like standard deviation/etc. Raw output should be fine, if it becomes an issue we can always reconsider but IPTB has no obvious advantage in doing those calculations. But I'd be open to changing my mind on this if anyone disagrees.

ipfs / iptb

Collecting statistics #77

General Thoughts

Output

Metrics

Stats