Store args in JSON output

tesch1 commented 5 years ago

It would be useful to (optionally?) store the arg(s) in the output file.

Let's say you have a benchmark that has different performance for different sizes of N. And you want to view a plot the performance of that function across a relevant range of N. Currently, for both JSON and csv, the N value is stored embedded in the benchmark name. So in order to use N as an axis, you need to parse it out.

For JSON it would be nice to be able to import a dataset and use the Ns without parsing them out of the name field. This could be addressed at the same time as #814 - there could be a flag to switch between the current mode and a new mode of allowing multiple entries with the same "name" but different "args" entries. (I assume the reason that "name" and "run_name" are the same is in order to always have a unique "name" (?))

For plotting in, ie gnuplot, having a column(s) in the CSV with the args would also be useful. And maybe extra newlines between name blocks.

LebedevRI commented 5 years ago

This could be addressed at the same time as #814 - there could be a flag to switch between the current mode and a new mode of allowing multiple entries with the same "name" but different "args" entries.

It shouldn't be a flag; more of that info should be serialized in JSON indeed.

tesch1 commented 5 years ago

Here's an implementation proposal, the goal is to make it easy to import a json file into existing libraries (ie pandas) and not have to muck around with the field names. Please let me know what you think:

output a key "function_name" to each entry in the JSON's "benchmark" which contains the function_name (+ optionally aggregate_name)
add keys "arg1","arg2",... for each of the args

For csv... I'm less sure of this being comprehensive... basically I want to be able to run gnuplot on a csv output file without mucking about too much. (It seems csv is not actually being deprecated?)

add a column that contains the same "function_name"
add a column for each arg
add extra breaks to the output file to match gnuplot's "dataset" structuring: https://stackoverflow.com/questions/35785030/plot-multiple-datasets-from-one-file

Probably the csv columns should be added after the existing ones, so as not to break existing column index dependencies, even though it's aesthetically 'alternative'.

LebedevRI commented 5 years ago

For csv...

I'm pretty sure it is being deprecated, so don't worry about CSV reporter, only JSON should be extended.

tesch1 commented 5 years ago

(ps. this is also related somehow to #190 )

Matthdonau commented 2 years ago

Hi @LebedevRI as this issue is pretty old I wanted to ask if this is still desired. If yes, I would suggest:

    {
      "name": "BM_ComplexityCaptureArgs/capture_test/2/4",
      "family_index": 9,
      "per_family_instance_index": 3,
      "run_name": "BM_ComplexityCaptureArgs/capture_test/2/4",
      "function": [
        {
          "name": "BM_ComplexityCaptureArgs",
          "argument_1": "2",
          "argument_2": "4",
        },
      ],
      "run_type": "iteration",
      "repetitions": 1,
      "repetition_index": 0,
      "threads": 1,
      "iterations": 35897,
      "real_time": 3.9685489296170933e+00,
      "cpu_time": 3.9279048388445341e+00,
      "time_unit": "ns"
    },

What do you think? Wanted to clarify before adapting the many test cases which parse the json output

dmah42 commented 2 years ago

i would have "name" as just "BM_ComplexityCaptureArgs" as the rest is in "run_name". i'm not sure of the point then in the function array. you could just have "arguments": ["2", "4"] instead.

Matthdonau commented 2 years ago

I like your idea (if I understand correctly) So to be on the same page you suggest?:

      "name": "BM_ComplexityCaptureArgs/capture_test/2/4",
      "family_index": 9,
      "per_family_instance_index": 3,
      "run_name": "BM_ComplexityCaptureArgs/capture_test/2/4",
      "name": "BM_ComplexityCaptureArgs",
      "arguments": ["2", "4"],
      "run_type": "iteration",
      "repetitions": 1,
      "repetition_index": 0,
      "threads": 1,
      "iterations": 38889,
      "real_time": 3.5474814532345795e+00,
      "cpu_time": 3.5742755020699999e+00,
      "time_unit": "ns"

dmah42 commented 2 years ago

you have name twice, but otherwise yes.

Matthdonau commented 2 years ago

Thanks. Will work on this once #1004 is done and if nobody else did it in the mean time :)

matta commented 1 year ago

There are a few other things that would be good to think about adding to the JSON report schema when implementing this:

Including the benchmark::State::SetLabel string.
Including the benchmark::State::counters.
Including benchmark::CustomContext key/value pairs.
Providing a way for benchmark functions to report arbitrary parameterizations (think "stringified" template args or enum names or even the semantic meaning of their numerical "args").

I think 1-3 are separate issues and not relevant here. We could add those with a schema addition, without touching name.

But 4 is interesting, since the "args" mechanism is not the only way benchmarks are parameterized. They can also be parameterized by template args, or passed arbitrary values at run time by way of RegisterBenchmark or its helper macro wrappers.

A good solution here could also allow a benchmark to provide a semantic name to its numeric "args". E.g. byte size, node count in a tree data structure, etc.

Thinking aloud, a benchmark could declare the semantic meaning of its args (pseudocode):

template <typename MyDatastructure>
void BM_Whatever(benchmark::State& state, MyAccessPattern access_pattern) {
  const int buffer_size = state.Arg(0);
  state.SetArgName(0, "buffer_size");
  state.SetParameterInt("buffer_size", buffer_size);
  state.SetParameterStr("datastructure", MyDataStructureName<MyDatastructure>::kName);
  state.SetParameterStr("strategy", MyAccessPatternToString(access_pattern));

  std::vector<char> buffer(buffer_size);
  ...
}

The simple arguments proposed above is useful, but perhaps there is also room for a key/value parameters. For the above example the JSON could look like this:

  arguments = [1024],
  parameters = [ { "buffer_size":  1024 }, { "datastructure", "vector" }, { "strategy", "uniformly_random" } ]

For any unnamed arguments, perhaps the library would still include them in parameters:

  arguments = [1024, 64],
  parameters = [ { "buffer_size":  1024 }, { "arg1": 64 }, { "datastructure", "vector" }, { "strategy", "uniformly_random" } ]

This way it becomes trivial to stringify the parameterizations from the json. I.e. there should be a 1 to 1 relation between unique run_name strings and unique parameters arrays.

Two questions I have:

Is this over engineered? Likely to be little used, so not worth it?
Should the APIs for naming the parameterizations be on benchmark::internal::Benchmark instead? I tend to think yes, since it would allow for an API to automagically come up with a unique benchmark run_name.

LebedevRI commented 1 year ago

Question is, what's the end use case? I think in general it is bad to enshrine in API something that is going to be rather obscurely used, but rather instead some generalization should be provided that allows to solve bigger problem. There's already global AddCustomContext(). Perhaps having the same but with different scopes will be sufficient?

matta commented 1 year ago

I think an AddCustomContext() on the benchmark::internal::Benchmark class is a good idea. I will file a separate issue for that.

google / benchmark

Store args in JSON output #838