scikit-hep / pyhf

pure-Python HistFactory implementation with tensors and autodiff
https://pyhf.readthedocs.io/
Apache License 2.0
282 stars 83 forks source link

research: time metrics with honeycomb #1115

Open kratsg opened 4 years ago

kratsg commented 4 years ago

Description

See the python SDK: https://github.com/honeycombio/libhoney-py

Workflow I had in mind

In general, we won't merge in PRs unless we can fix the slow stuff.

@ismith:

caveat: honeycomb is ideally meant for a lookback window of no more than 2 weeks; you can set a query to look back up to two months so in events, you can specify fields. you get duration_ms for ~free; you also might get function name for free. but you'll want, in config, to add maybe branch name and PR [id] and then you can do a query that creates a graph of master vs non-master, and define thresholds for yay/nay. we do not have an automated github check, so no automated enforcement, but we do offer slack/email/pagerduty and webhooks if you want somehting custom if you blog this when you're done we'll give you stickers and maybe a tshirt in a coveralls world this might be configurable as "PR is red, may not merge" same as if you failed CI we don't offer that out of the box and i don't know that you want that. but setting it up to comment on the PR is not hard to build with a webhook

matthewfeickert commented 4 years ago

Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.

matthewfeickert commented 3 years ago

This might be worth looking into if we can get an external grant to pay for us to run a small Digital Ocean or AWS instance to host this. Seems pretty valuable.

matthewfeickert commented 2 years ago

Probably also worth looking at airspeed velocity as this seems to be basically exactly what I had in mind.

NumPy and SciPy use asv for benchmarks, so it might be worth looking at how they do it.

An interesting thing is that asv will go and run tests on old commits automatically so you can automatically build the performance history.

I think(?) this might be possible to do with just a repo over in the pyhf org that runs things on a cron job.

matthewfeickert commented 2 years ago

c.f. also Is GitHub Actions suitable for running benchmarks?, where the answer is: yes.

matthewfeickert commented 2 years ago

And https://github.com/pydata/xarray/pull/5796 provides basically a template for how to do all of this!

matthewfeickert commented 2 years ago

In https://github.com/glotzerlab/signac/pull/776 @bdice mentions

We deleted the CI script for benchmarks from signac 2.0 anyway, because it's not reliable and we want to use asv instead.

@bdice I would love to talk to you about asv sometime as we've been wanting to set that up for pyhf for a while but haven't yet. If you have insights on how to get going with it I'd be quite keen to learn.

bdice commented 2 years ago

You can see signac's benchmarks defined here: https://github.com/glotzerlab/signac/blob/master/benchmarks/benchmarks.py

And the asv config: https://github.com/glotzerlab/signac/blob/master/asv.conf.json

And here's a quick reference I wrote on how to use asv: https://docs.signac.io/projects/core/en/latest/support.html#benchmarking

I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow. Sometimes I wish for features that don't exist (like being able to have greater control over test setup/teardown to ensure that caches are cleared between runs without having to regenerate input data -- something like pytest fixtures would be helpful). I've run into a handful of situations while running asv that felt like bugs but were difficult to trace down. I don't know of better alternatives to asv unless you have the time and energy to roll your own Python scripts, which is what signac had done for a long time. Eventually the maintenance of those DIY scripts and their limitations were annoying enough that outsourcing to asv felt like a good decision.

edit: I read some of the thread above. I have had really mediocre experiences with running benchmarks as a part of CI or on shared servers. Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O. The results from Quansight on GitHub Actions were extremely helpful for calibrating my own experience of annoyance with CI benchmarks in the past. I don't think the metrics they see for false positives and highly noisy data are good enough for what the signac project has needed in the past -- local benchmarks are much less variable in my experience.

astrojuanlu commented 2 years ago

Hi folks, @matthewfeickert asked me to leave my 2 cents here a few days ago. Basically 2 things:

Dedicated local hardware is the only way I've ever gotten metrics that I really trust, especially for a project like signac that is heavy on I/O.

This is 100 % correct. Here are the benchmarks we ran a few years ago in poliastro: the noisy lines are my own laptop (supposedly without doing anything else), the almost straight line is a cheap, dedicated server we rented on https://www.kimsufi.com/. Slower, but infinitely more useful.

benchmarks

I have mixed feelings about it. It can be difficult to make asv do what I want sometimes, and the project's development has been rather slow.

Recently they got a grant https://pandas.pydata.org/community/blog/asv-pandas-grant.html and managed to revamp the CI and make a release. The project has not seen more commits since then, so I agree it's not very active, but I'm not aware of any alternatives. The closest one would be https://github.com/ionelmc/pytest-benchmark/, but it's equally inactive.

matthewfeickert commented 2 years ago

Following up on @astrojuanlu's excellent points, I was talking with @gordonwatts at the 2022 IRIS-HEP Institute Retreat about this and he mentioned that he might have some dedicated AWS machines that we could potentially use (or at least trial a demo). Gordon, if you can elaborate on this as my memory from last week isn't as clear as it was the next day.

gordonwatts commented 2 years ago

We have an account that is connected with IRIS-HEP for benchmarking (@masonproffitt and I were going to use this for some benchmarking for our ADL Benchmark paper work, but it didn't happen). This is still active. Only Mason and I have access. But you get a dedicated machine of a certain specific size (at least, that is what the web interface says). So if one can basically build a script that does the complete install and then runs the test, this can be a cheap-ish way to run these.