Open hammer opened 1 year ago
I like thinking about safely shipping ML-related code, so I dig this issue and am open to more ideas.
Here's how I've previously approached testing when shipping ML models to production. I think the LLM world changes things a bit, but maybe we can borrow some ideas. I've also never worked on desktop applications.
In the past, I have thought about this as two questions:
For the former, I think traditional mocking should be sufficient. We know the APIs input parameters and expected response structure, so we can mock and test against that. This is what we currently do with calls to Grobid and OpenAI. Of course, the downside of mocking is that there will be unforeseen changes to both of those variables -- PDFs we haven't tested against resulting in API responses we hadn't expected. When that happens, I've always viewed the resulting bug as another test case to add (and maybe an indication I should have coded more defensively).
For the latter, I've thought about this in multiple parts. I'll imagine we're building a grocery delivery company to illustrate.
If we have a model that predicts how long it will take to shop for some groceries, then calling the model with all features = 0 should result in a prediction of 0. Or setting feature num_items_to_shop_for = 10_000
should result in prediction > 60 minutes
. Testing against "known knowns".
Imagine we shipped a bad model that is occasionally raising exceptions. Rather than error out, fall back to some heuristic or guardrail as the prediction, since running in a slightly degraded state is probably better for the user than not running at all. Protecting against "known unknowns".
If we're hitting the guardrail or falling back to the heuristic model, an alert would get thrown in Slack and the on-call MLEng would get paged. We'd also log predictions to the db and have dashboards that allowed us to monitor online model performance, so that we knew if there was some model drift or degradation (possibly due to retraining or shipping a new model). Capturing and quantifying "unknown unknowns."
Any new model went through a split test so that impact to related metrics could be measured (e.g. "how does a new version of our shopping time model
impact our actual % of late deliveries"). Additionally, having the {model_version}
always set through a feature flag allowed us to quickly toggle to a "safe" model without requiring a code change.
The refstudio world of desktop apps and LLMs is different, but I think there are common themes.
For part 2, we wrote a blog post a few years back on that topic that may be of interest: http://www.hammerlab.org/2015/09/30/testing-oml/
We're using OpenAI for example and it's annoying and potentially expensive to call out to their API in CI/CD.
I've seen various approaches to mock OpenAI such as LangChain's FakeLLM and https://github.com/ClerkieAI/bettertest.
I have generally worked on internal tools and packaged software so I haven't had to ship much production code that depends on external APIs, so I'm open to suggestions on how to best handle this situation for Ref Studio!