Switch to `cargo bench`

LoganDark commented 2 years ago

Python is a barrier to running and working on these tests; especially since the scripts look especially complicated. cargo bench offers nice results like this OOTB:

I would recommend moving there and ditching the hacky Python scripts.

(P.S. this is because I want to add getargs 0.5.0 to the tests after it releases, but I won't do that until I can actually run/test the benchmark, which I can't/won't do if it requires Python as a shell-script replacement.)

epage commented 2 years ago

cargo bench is not sufficient for our needs. Besides parse time, we also care about binary size, build times, and a little behavior validation.

We could switch the runtime benchmarks to cargo bench, testing end-to-end performance isn't that big of a deal but

We'd need two copies of each test and keep them in sync, one for parse time and one for everything else.
At least at the moment, our only performance use case we care about is passing in a lot of files via xargs so any extra features or benefits for cargo bench are minimal

Overall, unless we have sufficiently justified additional benxchmark use cases, I'm thinking changing this isn't worth it.

LoganDark commented 2 years ago

Is there any way for you to support (in any way) using cargo bench for users who may not want or be able to use Python?

epage commented 2 years ago

We could implement a custom test runner to do all of the behavior of the python script but that is a lot of work for little return.

LoganDark commented 2 years ago

Would you accept a PR? (If so, could you explain what your requirements are from the test runner?)

epage commented 2 years ago

If the PR is a custom test runner to re-implement all of the current behavior and continue to provide output fit for the README, I would be open to seeing how well it works out. Otherwise, not really as it'd run into the problems I had mentioned earlier.

You mentioned all of this is for adding getargs to the benchmark. That is also a whole separate conversation to have considering our acceptance criteria:

To be included, the crate needs meet one of the following criteria:

10k+ recent downloads

Unique API design

getargs doesn't have the recent download count and it looks to be in the same API design space as lexopt and clap_lex

In looking at getargs API to see if it offers something unique, I'd like to update my statement earlier:

We could switch the runtime benchmarks to cargo bench, testing end-to-end performance isn't that big of a deal

An important aspect of end-to-end benchmarks is that they reflect how most end-users will use the API and care would be needed to make cargo bench benchmarks reflect common real-world use cases.

getargs looks to be optimized for in-memory, &str, operations rather than for interacting with std::env::args_os. In naive benches, this means that getargs will not do any allocations. Most argument parsers assume they'll be taking ownership of the result from std::env::args_os and will need to allocations when that isn't the case, like in naive benches.

Results from benchmarks like those included in the getargs repo would need sufficient caveats on what the results are reflecting as most people would assume getargs will be significantly faster for their use cases when in reality it will be more even.

LoganDark commented 2 years ago

If the PR is a custom test runner to re-implement all of the current behavior and continue to provide output fit for the README, I would be open to seeing how well it works out. Otherwise, not really as it'd run into the problems I had mentioned earlier.

Yes, the PR would be to completely phase out Python and replace it with a custom test runner that performs all of the behavior of the current scripts. AFAIK, this is what I see:

A script to perform the benchmarks and write data to a file in the runs folder
A script to summarize the data from a particular run in a person-readable format

A custom benchmark runner can perform the first task, but the second task would probably be better suited to something like rust-script, so you could run (and tweak) the program at any time without having to perform another benchmark. This would remove the dependency on an external scripting language.

I don't fully understand how the test results are built, though.

(As for the getargs conversation, I'd like to avoid derailing this issue too much, but I do understand your concerns. This is a wishlist item, I do not expect to add getargs as soon as it releases.)

rosetta-rs / argparse-rosetta-rs

Switch to `cargo bench` #35