Add the initial API. - Githubissues

gabi-250 commented 5 years ago

This is by no means complete, but I thought I'd ask for some comments before I proceed any further.

I added a main.rs to demonstrate a typical use of the k2 library. Note that most of the values passed to the ExperimentBuilder are entirely optional. Using k2 could be as simple as:

    let exp = ExperimentBuilder::new()
        .add_lang_impl(&[&lang_impl::HotSpot, &lang_impl::CPython])
        .add_benchmark(Benchmark::new(...))
        .build();
    match exp.run() {
        Ok(result) => {
            CsvWriter::write(&Path::new("./result.csv"), &result);
            KrunV1Writer::write(&Path::new("./result.krun"), &result);
        }
        Err(error) => {
            // Report/handle the error.
        }
    }

Here is the breakdown of the library:

config.rs: Contains the Config structure. Each Experiment has a Config instance, which indicates how to run the experiment (whether it must run in quick mode, whether to reboot between pexecs, etc). The Config can either be generated from a .toml file (using Config::from_file(...)), or it can be manually populated with values. The parse_args function from main.rs shows one way the Config could be created from command-line arguments, although I think most people would find the former option easier (i.e. using Config::from_file).
error.rs: Contains the errors that might occur when running an Experiment.
experiment.rs: The ExperimentBuilder is used to build and configure the Experiment. The Experiment is actually the experiment runner. This also contains the Benchmark structure, which is supposed to indicate the path of a particular set of benchmarks. I'm not sure which of expb.add_benchmark(["/path/to/benchmark", "arg1", ..., "argn"]) and expb.add_benchmark(Benchmark::new(...)) would be easier to use... Any thoughts on this?
lang_impl.rs: Contains the languages supported by k2, and their implementation of the LangImpl trait (as discussed, each language must implement LangImpl).
limit.rs: A helper enum to indicate stack/heap limits. I'm not sure we actually need this.
writer.rs: contains a *Writer for every output format supported. The user of the library has to use these helpers to generate the result file(s). Alternatively, these could be used internally by k2.

I haven't thought much about the internal format for k2, because I wanted to come up with an API first.

I'm sure I got much of this wrong, so I can't wait to hear your thoughts!

ltratt commented 5 years ago

First of all, I'm really sorry it's taken me so long to even start look at this :(

It's going to take me a while to fully think through all of this because you've got a lot more done already than I expected (which is good)! One idea that's just occurred to me is that we might be able to abstract over things like benchmark names. The problem at the moment is that things like benchmark and runtime names are strings and so that means they're easy to get wrong: bench and Bench are different, for example, even though we found from Krun that people often intermix them. So maybe we can use AsRef in such places so that you can pass in a &str or a Benchmark (or Runtime or whatever), and when we create a Benchmark we could normalise the names (lower case etc.). @vext01 might have some thoughts on this!

ltratt commented 5 years ago

My next overall thought is what we want to do with lang_impls. In Krun we initially made a new one of these for every VM, but I now wonder if we should consider the generic case the "normal" one. At a basic level, a runtime is just a path to a binary (and maybe command-line args) which we call with the path of a benchmark. That would, I think, deal with everything except what Krun called "instrumented" mode, but I might be very wrong?

vext01 commented 5 years ago

Some high-level comments, having only read the description:

What should the user do if they don't want to marshall to a third-party results format? Just discard the result of run()?

The Config can either be generated from a .toml file

I was expecting the Rust program itself to serve as a kind of config file. I could be persuaded otherwise, but using just the Rust program means we have one unified configuration method, and thus less code.

I'm not sure which of expb.add_benchmark(["/path/to/benchmark", "arg1", ..., "argn"]) and expb.add_benchmark(Benchmark::new(...)) would be easier to use

You are going to want to allow things like setting environment variables on a per-benchmark basis, so perhaps the second? Another option is a fluent builder interface. I'm not sure what's best.

limit.rs: A helper enum to indicate stack/heap limits. I'm not sure we actually need this.

My only request is that we have a way to express limits in terms of different (user-specified) units. Only the other day I accidentally gave a benchmark 18TB of heap. Krun makes it easy to make these mistakes by insisting on limits in KB, whereas you might expected the unit to be bytes.

vext01 commented 5 years ago

In Krun we initially made a new one of these for every VM, but I now wonder if we should consider the generic case the "normal" one. At a basic level, a runtime is just a path to a binary (and maybe command-line args) which we call with the path of a benchmark.

Yep, we did actually realise that when making Krun. There's a GenericScriptingVMDef that covers a lot of VM defs which are invoked like ./interpreter script-file arg1 arg2...:

https://github.com/softdevteam/krun/blob/da9e46f1207a4ba99df6ae896f8fc24036b648dc/krun/vm_defs.py#L372

Many VM defs use this via sub-classes, thus re-using the core functionality.

There's a also a NativeCodeVMDef for languages which compile to ELF binaries (thus don't need an interpreter path):

https://github.com/softdevteam/krun/blob/da9e46f1207a4ba99df6ae896f8fc24036b648dc/krun/vm_defs.py#L347

These concepts worked well in Krun, and I suggest we do have something like this in K2. However, I suggest we don't require the user to make an explicit type for each language implementation; they can either instantiate the appropriate "generic" struct, or use a type alias.

On a related note:

.add_lang_impl(&[&lang_impl::HotSpot, &lang_impl::CPython])

Each instance of a language implementation should also have a string name to identify it. We may have several different CPythons in our experiment. I suggest canonicalising the name to lowercase as I can never remember capitalisations.

vext01 commented 5 years ago

.add_lang_impl(&[&lang_impl::HotSpot, &lang_impl::CPython])

Similarly, if we are going to have multiple (e.g.) CPythons, we are going to need the ability to specify the path to the different interpreters. We may also want to use the same CPython many times, just using a different environment, or interpreter argument.

I wonder if we need to separate language implementations from their "invocations" (arguments, environment, limits(?))?

gabi-250 commented 5 years ago

Thanks for the comments!

gabi-250 commented 5 years ago

@ltratt

The problem at the moment is that things like benchmark and runtime names are strings and so that means they're easy to get wrong: bench and Bench are different, for example, even though we found from Krun that people often intermix them. So maybe we can use AsRef in such places so that you can pass in a &str or a Benchmark (or Runtime or whatever), and when we create a Benchmark we could normalise the names (lower case etc.). @vext01 might have some thoughts on this!

If we allow &str and Benchmark to be used interchangably, does that mean all benchmarks must be located in the same, predefined directory (for example, the one specified through the K2_BENCH environment variable)?

gabi-250 commented 5 years ago

@vext01

What should the user do if they don't want to marshall to a third-party results format? Just discard the result of run()?

I would just discard it. Do you think it would make more sense for run itself to generate the results file based on the output format specified to the ExperimentBuilder? For example:

let exp = ExperimentBuilder::new()
        .add_benchmark(Benchmark::new(...))
        .output(&[OutputFormat::Csv, OutputFormat::KrunV1]) // like this
        .build();

The Config can either be generated from a .toml file

I was expecting the Rust program itself to serve as a kind of config file. I could be persuaded otherwise, but using just the Rust program means we have one unified configuration method, and thus less code.

Ok, I suppose that makes sense!

I'm not sure which of expb.add_benchmark(["/path/to/benchmark", "arg1", ..., "argn"]) and expb.add_benchmark(Benchmark::new(...)) would be easier to use You are going to want to allow things like setting environment variables on a per-benchmark basis, so perhaps the second? Another option is a fluent builder interface. I'm not sure what's best.

Unless there are many parameters to set for each benchmark, I don't think a builder interface is necessary.

ltratt commented 5 years ago

If we allow &str and Benchmark to be used interchangably, does that mean all benchmarks must be located in the same, predefined directory (for example, the one specified through the K2_BENCH environment variable)?

Good question. Maybe my suggestion wasn't a good one. That makes me wonder what the right way to handle identifiers is. For example k2 might want to allow people to dynamically specify a subset of benchmarks to run, so there's some notion of identifiers. Should we tie that to the file system? I'm not sure. Could you have two benchmarks called b.py in two different directories? Yes. Should we allow that? Dunno...

vext01 commented 5 years ago

@gabi-250 said:

Do you think it would make more sense for run itself to generate the results file based on the output format specified to the ExperimentBuilder.

I'm not sure what you mean. I'm OK with "just doing nothing" if you want only our internal format.

But we should make sure we can take existing results in internal format and convert them to other formats.

If we allow &str and Benchmark to be used interchangeably Passing a string to configure a benchmark is not going to be the most flexible. We might want to configure things like environment on a per-benchmark basis. Krun didn't support this, but it was never an issue (yet).

@ltratt said:

For example k2 might want to allow people to dynamically specify a subset of benchmarks to run, so there's some notion of identifiers. Should we tie that to the file system? I'm not sure. Could you have two benchmarks called b.py in two different directories? Yes. Should we allow that? Dunno...

Since we are considering the Rust program as a kind of config file in it's own right, running a subset doesn't need to be an explicit feature. It can be dealt with at the Rust language level. Perhaps we pass an iterable of benchmark instances into the k2 library, and the user can use functional combinators like std::iter::filter to exclude the benchmarks they don't want? Or something similar?

Should we tie that to the file system?

FWIW, Krun did expect benchmarks in a specific sub-folder. It never really caused any problems since that's typically what you want. Then again, it'd be more flexible to allow benchmarks to be in any path. I'm indifferent.

(But I do want the ability to have multi-file benchmarks in their own sub-folders please)

Could you have two benchmarks called b.py in two different directories? Yes. Should we allow that? Dunno...

I think we should allow that.

ltratt commented 5 years ago

(But I do want the ability to have multi-file benchmarks in their own sub-folders please)

What's a multi-file benchmark? Can't we always assume that every benchmark has a main file of sorts?

vext01 commented 5 years ago

It's a benchmark spread over many files.

This is common in java land, where they tend to put one class per file.

We had a few of these in the warmup experiment.

In such scenarios you don't want one large directory full of mixed benchmark files as it makes reviewing difficult and makes clashing more likely.

On Sat, 20 Jul 2019, 14:26 Laurence Tratt, notifications@github.com wrote:

(But I do want the ability to have multi-file benchmarks in their own sub-folders please)

What's a multi-file benchmark? Can't we always assume that every benchmark has a main file of sorts?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gabi-250/k2/pull/1?email_source=notifications&email_token=AAETWG4ZNU5QMKPSQHYSRRTQAMHBNA5CNFSM4IDRN4WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NOJCQ#issuecomment-513467530, or mute the thread https://github.com/notifications/unsubscribe-auth/AAETWG4GUSVKCVCCEAOGPVLQAMHBNANCNFSM4IDRN4WA .

ltratt commented 5 years ago

Why can't such benchmarks just be put in their own directory?

vext01 commented 5 years ago

That's exactly what I'm proposing ;)

On Sat, 20 Jul 2019, 15:32 Laurence Tratt, notifications@github.com wrote:

Why can't such benchmarks just be put in their own directory?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gabi-250/k2/pull/1?email_source=notifications&email_token=AAETWGZ7B33ZRKUX5MJIUO3QAMOWLA5CNFSM4IDRN4WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NPPTQ#issuecomment-513472462, or mute the thread https://github.com/notifications/unsubscribe-auth/AAETWG2DRUKKASFYWNF4CG3QAMOWLANCNFSM4IDRN4WA .

ltratt commented 5 years ago

OK, so can we simplify things to the following: every benchmark has an identifier, and that identifier is the filename of the "main" (or "entry" or whatever we want to call it) file for that benchmark. That avoids duplication; means we can use normal globbing to match benchmarks when we want to; and allows users to structure their directories however they want.

vext01 commented 5 years ago

OK, so can we simplify things to the following: every benchmark has an identifier, and that identifier is the filename of the "main" (or "entry" or whatever we want to call it)

I think that works (as long as we say which language implementations each entry point is for, somehow), although passing around the whole path might be a bit clumsy? Example IDs of this form for the warmup experiment would be:

benchmarks/binarytrees/java/KrunEntry
benchmarks/spectralnorm/lua/bench.lua

Would it be better to pass a short string as the ID, then the path to the entry point and the language implementation to run on as a separate piece of data? So we'd use the following IDs instead:

binarytrees
spectralnorm

Perhaps it'd look like this:

let my_hotspot = lang_impl::HotSpot::new(...); // where ... is flags and environment.
let My_graal = lang_impl::Graal::new(...);
Benchmark::new("binarytrees", [my_hotspot, my_graal], "benchmarks/binarytrees/java/KrunEntry");

In general, everything needs a string ID, so it can be addressed in the results.

I'm wondering if we should be using fluent interfaces wherever possible, as they allow us to have implicit sensible defaults. For example above when defining instances of languages, flags and environment can be omitted for defaults.

I also question if add_lang_impl() (in Gabi's original design) is necessary:

    let exp = ExperimentBuilder::new()
        .add_lang_impl(&[&lang_impl::HotSpot, &lang_impl::CPython])
        ...

The language implementations (or rather instances of them) can be enumerated from the list of benchmarks. In other words, we will need to tell K2 which language implementations to run each benchmark on anyway, so there's no need to explicitly list them.

[In Krun we had the notion of a "benchmark variant", which we initially included for language composition purposes. In hindsight, I don't think it's necessary, as variants can simply be benchmarks in their own right.]

ltratt commented 5 years ago

passing around the whole path might be a bit clumsy?

It's up to the user if they pass an absolute or relative pathname, so I don't think this is too awful.

You raise an interesting question which is whether we should view a benchmark B as having multiple language implementations; or whether we should view each as separate. In other words, if I have B.py and B.rs, should we explicitly consider that the benchmark B in Python and Ruby, or two unconnected benchmarks? I think the latter is simpler to work with, and that we only encounter issues when we come to comparing benchmarks (at which point we can probably find some nice way of allowing people to link different-language benchmarks together in the stats tool).

vext01 commented 5 years ago

It's up to the user if they pass an absolute or relative pathname, so I don't think this is too awful.

I gotta say, compared to Krun, I find even passing the relative path clunky. I much prefer a short string identifier.

In other words, if I have B.py and B.rs, should we explicitly consider that the benchmark B in Python and Ruby, or two unconnected benchmarks?

FWIW, the way krun does it, it assumes there is a version of each benchmark for every language in the experiment (unless there is a "skip" entry). I think we can do better than this, however, I don't think we should consider every implementation a separate benchmark, as it will make the results less structured.

If we have 2 benchmarks, b1 and b2, and two language impls, l1 and l2, and we consider everything a separate benchmark, the structuring of the results would be something like:

results/
- b1-l1/
- b1-l2/
- b2-l1/
- b2-l2/

In other words it's a flat structure. To k2 there are 4 separate unrelated benchmarks. We are going to want to do things like iterate over benchmarks regardless of which language, i.e. over [b1, b2]. Under the flat model, you'd have to play string manipulation games to find the benchmark grouping.

I'd prefer the have structure like this:

results/
- b1/
- l1/
- l2/
- b2/
- l1/
- l2/

Here the grouping of the benchmarks is clear.

Thoughts?

ltratt commented 5 years ago

I find even passing the relative path clunky.

Note that I'm implicitly assuming that, after defining a benchmark, people use globbing to reference it. I should have been clear about that.

I don't think the way we structure a results directory (if we even have a results directory) should influence things. Personally, I think people will sometimes want to order results by implementation and sometimes by benchmark. In the warmup paper we do both (the "main table" is ordered by benchmark; things like DaCapo are ordered by implementation). That sort-of suggests that a flat layout might be best, and we then come up with some cunning way to group benchmarks as desired when analysing stats.

vext01 commented 5 years ago

Note that I'm implicitly assuming that, after defining a benchmark, people use globbing to reference it. I should have been clear about that.

I find that somewhat odd, and I think that will make it easy to accidentally match something they don't mean to. Why not keep it simple?

I think people will sometimes want to order results by ...

The structure isn't really about ordering, it's about semantic grouping.

In general (and drawing from my experience with Krun) it's better to make things very simple and explicit.

ltratt commented 5 years ago

In general (and drawing from my experience with Krun) it's better to make things very simple and explicit.

I'm fine with the "explicit" reasoning, but predefined grouping definitely isn't simple. As I said, even in the warmup paper, we group things differently at different points...

vext01 commented 5 years ago

I'm fine with the "explicit" reasoning, but predefined grouping definitely isn't simple. As I said, even in the warmup paper, we group things differently at different points...

Maybe my example wasn't the best. An arbitrary grouping/ordering (however the user wants) can more easily be constructed by retaining the conceptual link between a benchmark and its different implementations.

All I'm saying is: it'd be better if we didn't consider different implementations of the same benchmark as different benchmarks.

Hopefully we can agree the merit of this at least?

vext01 commented 5 years ago

[You could image having different library-level iterators that iterate over things in different groupings (by benchmark/by language), but the link between a benchmark and its implementations needs to be there]

ltratt commented 5 years ago

Edd and I have discussed this on MM. We haven't fully come to a conclusion, but I think we've got a lot further in our thinking. I'm going to present my POV, so please don't assume that this necessarily represents his thoughts!

Fundamentally, I think a "benchmark" should be a reference to a single file, which (via its path) also provides a handy human-readable ID: conceptually all benchmarks are stored together (so we don't store them using any particular structural concept). For safety across multiple machines we should probably consider the ID a (machine, path) tuple. When represented as text we can use ssh format i.e. machine:path for an ID.

We then get into the trickier question of what extra information one records about a benchmark. My proposal is that we probably need to allow user-extendable tags. For example, if I want to group benchmarks in multiple languages together, I might want a tag abstract_benchmark_name and the two (separate) benchmarks B.py and B.rs will then both have the same abstract_benchmark_name=B. Some people might also care when a benchmark was run, or what memory it used etc. etc. etc.

How should we represent tags? The easiest way is a HashMap, which allows people to extend them at will, though it does bypass Rust's type system. Perhaps that's the best solution, I'm not sure. An alternative might be a Benchmark trait which has get_tag_value(key: &str) -> &str and matches_tag(key: &str, val: &str) -> bool where the latter can choose to do globbing (or whatever) in how it choose to match a tag. The disadvantage of this approach is that there can no longer be a single Benchmark struct, and users would have to paramaterise functions by their particular implementation of the Benchmark trait.

Whichever way we go, at some point this should allow us to have other clients of k2 which can do things like say "process all benchmarks XYZ and output stats for them" e.g.:

$ warmup_stats2 --output-table table.pdf --by-id */B.*

or:

$ warmup_stats2 --output-table table.pdf --by-tag abstract-benchmark-name B

or even:

$ warmup_stats2 --output-table table.pdf --by-tag date 2019-07-20-

which (assuming an appropriate match_tag) would mean "output stats for all benchmarks run in the last two days".

Comments welcome!

vext01 commented 5 years ago

Scattered thoughts:

For safety across multiple machines we should probably consider the ID a (machine, path) tuple.

I'm a little confused by this bit. So are we expecting actual implementations of the benchmarks to vary across machines? I'd expect the benchmarks to be the same across all machines.

I agree we will need (as Krun did) a way of combining multiple sets of results and then addressing pexecs from different machines, but I don't see why that has anything to do with identifying a benchmark.

I'm perhaps warming to Laurie's path globbing and tags idea, but I'm still not totally sold yet.

I'd like to see the Rust code for an experimental setup using this scheme before I can really decide if it's going to be too clumsy or not. Perhaps we can port one of the Krun example setups and use it as a case study?

E.g.: https://github.com/softdevteam/krun/blob/master/examples/example.krun

--by-tag --by-id

Can you specify them both? If so, is that an AND or an OR? Would a user conceivably ever need both AND and OR?

ltratt commented 5 years ago

For clarity: globbing is part of my proposal but it's not all of it. With the trait I suggested earlier, you can use globbing if an implementation of the trait supports it: but, equally, if you know the concrete Rust struct S implementing the trait, you can just access attributes normally. In other words, I'm sort-of proposing a superset of behaviour, not the subset that I think you're assuming?

I'm a little confused by this bit. So are we expecting actual implementations of the benchmarks to vary across machines? I'd expect the benchmarks to be the same across all machines.

Maybe you're right that -- the machine information could just be a tag. That gives us more flexibility in the longer term.

The only problem is that there's no guarantee that paths will be the same on different machines. Maybe this is one worry too many?

Can you specify them both?

That's a design decision. At first I would probably disallow specifying both, for simplicities sake, and only add it later if there's a good use case.

gabi-250 commented 5 years ago

(Sorry for the late reply!)

@ltratt Having read the whole conversation, I'm still not sure I totally understand how the tagging/globbing is going to work.

Note that I'm implicitly assuming that, after defining a benchmark, people use globbing to reference it. I should have been clear about that.

So, if, for example, you want to skip running benchmark b (uniquely identified by the path of its entry point), would you use globbing like this?

let my_hotspot = lang_impl::HotSpot::new(...); // where ... is flags and environment.
let My_graal = lang_impl::Graal::new(...);
// The unique identifier of `b` is "./benchmarks/binarytrees/java/KrunEntry"
let b = Benchmark::new("binarytrees", [my_hotspot, my_graal], "./benchmarks/binarytrees/java/KrunEntry");

// Add b to the experiment, then skip b
expb.add_benchmark(b).add_skip("*binarytrees*java");

Or would it be better to define a tag for b, and always skip benchmarks based on the values of their tags?

ltratt commented 5 years ago

It's confusing because my/our ideas evolved over the course of the conversation :)

Maybe it's better to first talk about what we want to achieve, rather than mechanisms for achieving that. Here's my basic thoughts:

1) We need to have a clear notion of what "a benchmark is" and a mechanism for uniquely referring to it. 2) Users of k2 will want to record arbitrary data about benchmarks (i.e. they'll want to record things we won't have thought of). 3) Once data has been recorded, people will want to analyse it, and they'll often want to do that at the CLI. We need to give them a way to access parts of the data easily.

Paths to the "main" function of each language's benchmark are my proposed solution for 1), tags are my proposed solution for 2) and globbing my proposed solution for 2). There may be better ways of doing each, though!

That means that I think we need to tweak the last line of your code a bit because I don't think add_skip quite makes sense. Basically if the k2 user doesn't want to run a benchmark (or doesn't want to use one of the runtimes), I think it's beholden on them not to add it to the system. But perhaps that's a bit naive of me.

One thing Edd said earlier is starting to haunt me: I think we are heading in the direction of making a sort of ad-hoc database...

gabi-250 commented 5 years ago

Thanks, it's a lot clearer now!

I've just pushed my second attempt at this, and main.rs looks something like this:

fn main() {
    let python_bin = find_executable("python");
    let pypy_bin = find_executable("pypy");
    let luajit_bin = find_executable("luajit");
    let expb = setup();
    let cpython = GenericScriptingVm::new(Lang::Python,
                                          Path::new(&python_bin));
    let pypy = GenericScriptingVm::new(Lang::Python,
                                       Path::new(&pypy_bin));
    let luajit = GenericScriptingVm::new(Lang::Lua,
                                       Path::new(&luajit_bin));
    let python_bench = SimpleBenchmark::new(
        Path::new("./benchmarks/binarytrees/binarytrees.py"))
            .tag("abstract_benchmark_name", "binarytrees")
            .add_lang_impl(&cpython)
            .add_lang_impl(&pypy);
    let lua_bench = SimpleBenchmark::new(
        Path::new("./benchmarks/binarytrees/binarytrees.lua"))
            .tag("abstract_benchmark_name", "binarytrees")
            .add_lang_impl(&luajit);
    let exp = expb
        .add_benchmark(&python_bench)
        .add_benchmark(&lua_bench)
        .build();
    // `run` outputs the result in the k2 internal format.
    let _ = exp.run().expect("Failed to run the experiment");
}

fn setup<'a>() -> ExperimentBuilder<'a> {
    let expb = parse_args(ExperimentBuilder::new());
    // These could've been command-line arguments too.
    expb.pexecs(2)
        .stack_lim(Limit::KiB(8.192))
        .heap_lim(Limit::GiB(2.097152))
}

fn parse_args(expb: ExperimentBuilder) -> ExperimentBuilder {
    // Parse the args and create a `Config`.
    let matches = App::new("k2")
        .arg(Arg::with_name("quick")
                .short("q")
                .long("quick")
                .help("Run the benchmarks straight away. For development only."))
        ...
    expb.quick(matches.is_present("quick"))
        .dry_run(matches.is_present("dry-run"))
        .reboot(matches.is_present("reboot"))
}

This is supposed to be the k2 equivalent of example.krun. Hopefully this is somewhat closer to what you were expecting.

I implemented @ltratt's idea to use tags to record arbitrary data about benchmarks. A benchmark is defined as follows:

pub struct Benchmark<'a> {
    /// The unique identifier of the benchmark.
    id: PathBuf,
    /// Additional information.
    tags: HashMap<String, String>,
    /// The language implementations to run on.
    lang_impls: Vec<&'a dyn LangImpl>
}

I feel like 3) needs to be implemented in warmup_stats2 rather than k2, but I could be wrong. I suppose k2 only needs to dump the id and the tags along with the results.

A 'language implementation' is just an instantiation of GenericScriptingVm (or GenericNativeCode):

pub struct GenericScriptingVm {
    /// The language implemented by this VM.
    lang: Lang,
    /// The path of the interpreter.
    interp_path: PathBuf,
    /// The environment to use when running the VM.
    env: HashMap<String, String>
}

Both GenericScriptingVm, and GenericNativeCode implement the LangImpl trait:

pub trait LangImpl {
    fn results_key(&self) -> &str;
    fn target_lang(&self) -> Lang;
    fn invoke(&self, args: Vec<String>);
}

Does this seem reasonable?

ltratt commented 5 years ago

I think this is starting to look good! A few initial comments:

I'm not sure the lang item in GenericScriptingVM is necessary: this can just be a tag, if people care about it? Put another way, I think the idea of the "generic" part of the name is that it doesn't have any language specific support.

I feel like 3) needs to be implemented in warmup_stats2 rather than k2, but I could be wrong.

That's my thinking too.

Can we use AsRef so that people can mostly pass in &str instead of Path? If so, I think that might simplify the code quite a bit.

gabi-250 commented 5 years ago

Thanks for the comments!

I'm not sure the lang item in GenericScriptingVM is necessary: this can just be a tag, if people care about it? Put another way, I think the idea of the "generic" part of the name is that it doesn't have any language specific support.

You're right, I'll remove it.

Can we use AsRef so that people can mostly pass in &str instead of Path? If so, I think that might simplify the code quite a bit.

I changed all the new functions to receive &str instead of &Path arguments, because I think AsRef is unnecessary here. Let me know if you think otherwise.

vext01 commented 5 years ago

Hi Gabi,

We've just had another chat about the way in which we represent benchmarks and their data and we think we can generalise even further, but we'd like to hear your opinions too.

Under the current model, a benchmark is a path and a set of tags. What we asked ourselves is "why is the path special" (and not a tag, like all other data?). So instead of:

SimpleBenchmark::new(
        Path::new("./benchmarks/binarytrees/binarytrees.py"))
            ...

We can have something like:

SimpleBenchmark::new()
            .tag("filename", "./benchmarks/binarytrees/binarytrees.py")
            ...

In other words, why not "everything is a tag"? Under this model, the user then has ultimate flexibility over what the "primary key" is.

We could even go one step further and get rid of even language implementations in favour of tags. Ultimately a language implementation boils down to:

An ordered collection of command line arguments.
An environment.

We think those could be two more (by convention) tags like any other. And going one step more, even the results could be tags, which would allow us to query the actual timings under the same framework as above.

[Of course, we will probably have some tags which are "by convention" (like the filename)]

What do you think?

[I notice that we don't yet have a way to communicate how many pexecs and IPIs (in-process iterations)]

[Next we are going to think about storage. The only thing we've really thought about deeply with regards to that is that we want the data to conceptually be a list (or set) of a set of tags]

gabi-250 commented 5 years ago

Hi,

Under the current model, a benchmark is a path and a set of tags. What we asked ourselves is "why is the path special" (and not a tag, like all other data?). So instead of:
SimpleBenchmark::new(
        Path::new("./benchmarks/binarytrees/binarytrees.py"))
            ...
We can have something like:
SimpleBenchmark::new()
            .tag("filename", "./benchmarks/binarytrees/binarytrees.py")
            ...
In other words, why not "everything is a tag"? Under this model, the user then has ultimate flexibility over what the "primary key" is.

This sounds reasonable enough. I guess the path might as well be a tag.

We could even go one step further and get rid of even language implementations in favour of tags. Ultimately a language implementation boils down to:
* An ordered collection of command line arguments.

* An environment.
We think those could be two more (by convention) tags like any other.

What would this type of tag look like?

We somehow need to map the environment, and the command-line arguments to the language implementation they were intended for. Should we (for example) reserve tags of the form lang_impl:<name>:args and lang_impl:<name>:env (and perhaps lang_impl:<name>:bin) for this purpose (where name is the identifier of the language implementation)?

I imagine this would look something like this:

 SimpleBenchmark::new()
     .tag("lang_impl:pypy:env", "PYPYLOG=jit-log-opt:log")
     .tag("lang_impl:pypy:bin", "/usr/bin/pypy")
     .tag("lang_impl:pypy:args", "arg1 arg2")
      ...

Unless I've misunderstood what you are suggesting, I'm not sure this is any better than the original approach. Having to parse the key of a tag to extract information out of it feels a bit awkward. Or do you think that if we keep the structure of the keys simple, parsing them won't ever be a problem? I don't feel too strongly about this, but I think having to remember all the keys reserved by convention is somewhat harder than learning how to use a more complex API. What do you think?

And going one step more, even the results could be tags, which would allow us to query the actual timings under the same framework as above.

[Of course, we will probably have some tags which are "by convention" (like the filename)]

What do you think?

I think this could work. Again, I am not entirely sure what a 'timings' tag would look like. Would it contain all the timings for a particular benchmark? Or would we have a separate tag for each measurement?

[I notice that we don't yet have a way to communicate how many pexecs and IPIs (in-process iterations)]

I think you're right. I originally thought the number of pexecs is the same for each benchmark in the experiment, but I guess there is no reason for that constraint:

fn setup<'a>() -> ExperimentBuilder<'a> {
    let expb = parse_args(ExperimentBuilder::new());
    ...
    expb.pexecs(2)
    ...
}

Would it make sense for pexecs and IPIs to be tags instead?

ltratt commented 5 years ago

We could even go one step further and get rid of even language implementations in favour of tags.

I didn't notice this bit. I think Gabi's right that there's something importantly different with these. Tags are about recording what was done. Language implementations (and other things about setting up a benchmark) are about defining what should be done. In other words, I think we should more-or-less keep the API for setting the benchmarks up as-is: we can later record the settings that were defined using tags.

vext01 commented 5 years ago

We somehow need to map the environment, and the command-line arguments to the language implementation they were intended for. Should we (for example) reserve tags of the form lang_impl::args and lang_impl::env>...

I think there's been some crossover of benchmark setup and results storage in my thinking. In hindsight those things probably need to be separate.

At this point I think we should try something and be willing to iterate. I don't think we will get this right first time.

gabi-250 commented 5 years ago

@vext01

At this point I think we should try something and be willing to iterate. I don't think we will get this right first time.

I agree. How should we proceed from here? Do you want me to sketch out an implementation?

vext01 commented 5 years ago

Do you want me to sketch out an implementation?

I guess we begin by making a best attempt to integrate the discussion points above. Anything we are not sure of, we can do something simple and come back to it later if it isn't right. That will resolve this PR.

Then (as long as @ltratt is happy), I'd suggest the next PR starts fleshing out the implementation some more.

ltratt commented 5 years ago

Works for me.

My only suggestion is that, when this PR is done, we should probably move this repository to the softdevteam account (if Gabi is happy with doing so).

gabi-250 commented 5 years ago

Works for me.

My only suggestion is that, when this PR is done, we should probably move this repository to the softdevteam account (if Gabi is happy with doing so).

Of course!

vext01 commented 5 years ago

That'd be good! Then we get notifications.

gabi-250 commented 5 years ago

Ok, this should now be ready for review. I'm sure the API is going to change once we start working on the implementation, but I wanted to begin by merging the skeleton of the project.

A few things worth noting:

src/main.rs shows how to set up a simple (but not necessarily minimal) experiment.
Tags are currently stored in a TagStore in Benchmark. This will likely change when I start working on the implementation, as I'm not sure this is the right way to store them. Benchmarks are created using Benchmark::new(...) where ... is the path of the benchmark. The path is still going to be stored as a tag (identified using the key path), but since it's not an optional tag, you have to set it when creating the benchmark. Tags are set using the tag function of Benchmark. Since we want the results to be tags as well, after running a benchmark, the TagStore should contain the measurements recorded.
Configuration is not done using tags. The Config struct contains the configuration variables available to the user. This struct is created through the ExperimentBuilder. For example, to set the number of pexecs and the number of ipis, you would run:

    expb.pexecs(2)
        .in_proc_iters(40)

[k2 users might also want to exclude (for example) wallclock times from the results: expb.wallclock(false).]

I'm not sure if this is a limitation, but pexec/ipi counts, and stack/heap limits are part of the configuration of an experiment, which means these values are the same for each benchmark in the experiment (you can't, for example, set a different heap limit for each benchmark).

vext01 commented 5 years ago

you can't, for example, set a different heap limit for each benchmark

We should probably allow that. People have done benchmarks where they vary things like that just to see if it has an effect.

vext01 commented 5 years ago

All in all, this looks like a really solid start! Just a few comments.

vext01 commented 5 years ago

Ah! Please add a license to every file!

Assuming you are OK with the Apache license, here's one you can use as a basis: https://github.com/softdevteam/yk/blob/0ff2fb8799120c9b77660c5b6de922aa206c0f43/tiri/src/lib.rs#L1

In addition to listing our team, I think you should also list yourself personally in the license, as you are effectively contributing on you own spare time. Better check with @ltratt though.

ltratt commented 5 years ago

I'd prefer the triple licence from https://github.com/softdevteam/yksom/blob/master/src/main.rs (rather than the dual licence in yk). It doesn't worry me too much who's listed -- indeed, King's doesn't even own the copyright over Gabi's code, so she can, if she so wishes, copyright things herself.

gabi-250 commented 5 years ago

@ltratt @vext01 I don't feel strongly about the copyright to be perfectly honest: King's will eventually own the copyright for k2 anyway! I'll list myself in the license for now.

Let me know if you'd like me to change anything about the license.

gabi-250 commented 5 years ago

@vext01

We should probably allow that. People have done benchmarks where they vary things like that just to see if it has an effect.

I changed things a bit and now you can set a different stack/heap size limit for each benchmark. Also, Benchmark::new now requires users to specify the language implementation being benchmarked (in addition to the path of the actual benchmark). I think I've made the API slightly more difficult to use, but I don't really see a way around it.

Also, I think it feels a bit awkward that stack/heap limits are fields of Benchmark, as opposed to tags. I could implement Display and FromStr for Limit to allow users to pass Limit variants as tag values, but I'm not convinced it's be a good idea.

ltratt commented 5 years ago

I think "Copyright Gabi created by soft-dev" makes less sense than just "Copyright Gabi" myself :)

gabi-250 commented 5 years ago

I don't want to take credit for everything: it was your idea after all! How about I list both myself and soft-dev like in packed-vec?

ltratt commented 5 years ago

Works for me!

softdevteam / k2

Add the initial API. #1