Inform about the number of observations

AndreasMadsen commented 6 years ago

To compute a confidence interval or determine if there is a statistical difference between two versions, at least three pieces of information is required:

the mean
the standard deviation (or variance)
the number of observations, that the mean and standard deviation were sampled from.

The number of observations is not really clear from the output.

In a previous conversation @mcollina wrote:

The approach that autocannon takes is that it samples once per second the number of requests that happened within that second (https://github.com/mcollina/autocannon/blob/master/lib/run.js#L119-L122). So, it's 30 samples.

It is not really clear to me how that data is aggregated. Or how it relates to 30 samples.

mcollina commented 6 years ago

Every time a request is successfully completed, autocannon increases the counter variable by 1 in https://github.com/mcollina/autocannon/blob/master/lib/run.js#L210-L218.

Every second, the counter value is sampled and reset to 0 in https://github.com/mcollina/autocannon/blob/master/lib/run.js#L111-L119.

The sampled value goes in an instance of https://www.npmjs.com/package/hdr-histogram-js, which provides all the calculations. This is probably not the best data structure for this, as we do not have a lot of samples to deal with.

The number of samples for the req/sec and throughput is equivalent to the number of seconds the benchmark has run.

AndreasMadsen commented 6 years ago

The sampled value goes in an instance of https://www.npmjs.com/package/hdr-histogram-js, which provides all the calculations. This is probably not the best data structure for this, as we do not have a lot of samples to deal with.

If the purpose is constant memory, see: See Welford online algorithm.

The number of samples for the req/sec and throughput is equivalent to the number of seconds the benchmark has run.

Hmm, this is a bit odd. Especially if nothing happened in that second. I also don't understand how it doesn't contract the your first statement "Every time a request is successfully completed, autocannon increases the counter variable by 1".

Can you show me where the final mean and standard deviance values are calculated? Maybe I can backtrack it from there.

mcollina commented 6 years ago

Is there a module you would recommend to calculate online variance, mean, min and max? It would not fix the sample issue.

Mean and standard deviance are calculated by the hdr histogram based on the recorded values (samples). As I am interested in the mean and stddev of the number of requests that happens in a second, I count the number of request in every given second and then use that value as my sample. If none happens in a given second, that’s a zero to me.

This modules is what I use to create the end results: https://github.com/thekemkid/hdr-histogram-percentiles-obj/blob/master/index.js

AndreasMadsen commented 6 years ago

Is there a module you would recommend to calculate online variance, mean, min and max? It would not fix the sample issue.

I don't think it has been implemented in a module. But you can steel the one from clinic-doctor: https://github.com/clinicjs/node-clinic-doctor/blob/master/analysis/guess-interval.js#L145

I really like that algorithm as it is quite numerically stable, constant memory and easy to implement. Theoretically, it is not as stable as a two-pass algorithm but for most purposes it is fine. Theoretically, it also uses more flops but in practice that is easily offset by its constant memory that makes it fit into most registers.

Mean and standard deviance are calculated by the hdr histogram based on the recorded values (samples). As I am interested in the mean and stddev of the number of requests that happens in a second, I count the number of request in every given second and then use that value as my sample. If none happens in a given second, that’s a zero to me.

I see. I thought long and hard about it, and ... it is okay. I threw me off a bit because you will artificially decrease the standard deviation when you decrease the sample resolution, which is opposite of most intuition. However, the standard error will compensate for that in the end, so it is fine.

You are breaking a bunch of independence assumptions because a new request can't be started before another has completed. Supposedly, you could fix that with some fancy Poisson-variant of a Gamma distribution, but I think that ends up creating just as many new assumptions that are likely to break as well.

mcollina commented 6 years ago

I know about the broken assumptions. However I’m not sure how to better express those. Do you think it would be better to use percentiles instead of stddev? See https://github.com/mcollina/autocannon/issues/138.

AndreasMadsen commented 6 years ago

I know about the broken assumptions. However, I’m not sure how to better express those.

I would think too hard about it, as I don't see any good solutions.

Do you think it would be better to use percentiles instead of stddev? See #138.

A standard deviation doesn't assume anything about the distribution, so it is not invalid in that sense. However, it is a pretty useless summary to present. Not because the data might not be normally distributed but because it says nothing on the variation/deviation of the mean on its own. At the very least, you need to also know the number of samples it was estimated from.

Think of the standard deviation as an intermediate value that is good to keep around and easy to do math with. It is excellent for that purpose but it is not a good final value to present. (Actually, the variance is easier to do math with but they are more or less the same thing).

In conclusion, I would definitely show the empirical 2.5% and 97.5% percentiles instead.

As for what distribution it actually is, I would need to understand how the latency is estimated to give an educated guess, but most likely it is a gamma distribution that is close to a normal distribution. But really, just plot the histogram. For the advanced user, there is also the Kolmogorov–Smirnov test.

For my NodeConf EU talk, I actually used gamma distributed data, as that is what typically exists in benchmarks. And you know what, everything works out okay :)

In any case, I would really recommend that you include the number of observations the mean and variance are estimated from. While autocannon may run for 30 seconds, it is not obvious it is also 30 observations. You could also have sampled every 100ms, in which case you would have 300 observations.

GlenTiki commented 6 years ago

Seems like we should also make the number/frequency of observations configurable, defaulting to 1 per second.

AndreasMadsen commented 6 years ago

Seems like we should also make the number/frequency of observations configurable, defaulting to 1 per second.

That would be great for statistical significance. Just remember to scale the output unit so it remains n/s.

drewva commented 2 years ago

Hey, I was wondering: is this enhancement is still desired by the autocannon team?

mcollina commented 2 years ago

Sure if you would like to work on it!

drewva commented 2 years ago

I made some progress on the enhancement, but I have a few questions about how you all want it to turn out, specifically:

As for input validation, what invalid inputs should I account for? I currently have it essentially the same as for the duration input.
When the default is not being used, what letter should represent the input about samples? s & S are already being used.
When the default is not being used, should the user specify the time interval of each sample or the total number of samples? I am using the time interval for now, but I can change it.
When the data is being collected, should the progress bar update every second or every sample?
In what format should the final information be output? We have access to both the sample interval & the total number of samples, so we can use either or both.

mcollina commented 2 years ago

Could you make suggestions? Pick what makes sense to you, we can check it during review.

drewva commented 2 years ago

My suggestions are:

The user controls the interval between samples, while the total # of samples is derived.
The sample interval can be configured using -L in which L stands for length.
The progress bar updates every second.
The input validation & the format of the output would take longer to describe, but if you all are ready for me to push this onto a new branch (or new fork if that's the preferred way of contributing) & open a pull request then I guess you can review it there while you're reviewing, testing, etc.

mcollina commented 2 years ago

Go for it!

mcollina / autocannon

Inform about the number of observations #143