Open falquaddoomi opened 1 year ago
This is an important point that needs to be addressed (maybe in the next iteration?). In the current code, I tried to do something like this by putting all unit tests that hit the API into test_model_revision.py
, so that could be a quick and dirty solution to this issue.
Nonetheless, I think that using unit tests to test prompts is not the best idea. One possibility is to try something like OpenAI Evals, as briefly described here.
Just wanted to upvote this idea, I think it'd be useful to distinguish the OpenAI tests. Maybe pytest's custom markers could be used here with the existing test suite structure? This could also be extended to account for additional markers beyond the scope of just this issue and be configured with certain defaults, etc.
I'm inclined to completely remove any prompt testing that involves hitting an external, expensive API from the unit tests, and only leave tests that can be run locally for free. The current unit tests that do that should probably be moved to another framework for prompt evaluation such as OpenAI Evals. We could potentially leave some small integration tests that hit a local model and use pytest functionality to run them or not depending on whether the host has a GPU or not.
Agreed about moving the tests that run at cost to another framework. In the meantime, just to show how @d33bs's suggestion about using markers would work, I went ahead and implemented them in PR #33 and marked all the tests in test_model_revision.py
as "cost". We can use that implementation as a starting point for, say, your GPU testing idea, @miltondp. We can probably infer whether the user has a GPU and use that as the default value, but it's still IMHO nice for users to be able to conditionally enable/disable the feature.
I'd noticed that running the full test suite uses the OpenAI API, which consumes a modest amount of credits. It also takes a little while for the requests to go through, since it's hitting an external service.
This is just a proposal, so feel free to decline and close this if you disagree. Anyway, what if we were to partition the tests into, say, full integration tests that could potentially cost money, and non-integration tests that just test our code? If a non-integration test needs an API response, we could mock it or otherwise return a locally-cached response. I imagine we could iterate more quickly and also integrate it safely into a CI/CD system if we could run just the subset of tests that test our code.