manubot / manubot-ai-editor

31 stars 6 forks source link

Explore the use of static LLM model implementations for deterministic testing and potentially lowered costs #27

Open d33bs opened 1 year ago

d33bs commented 1 year ago

This issue was inspired by #25 through considering testing limitations. This issue proposes the use of static models like the Llama and many others which could be referenced as part of testing procedures to help increase deterministic testing results through explicit model versioning (acknowledging some continuous updates from OpenAI models) and also potentially reduce the costs of API access to OpenAI. Benefits might include increasing testing assurance, expanding how new / different models perform with manubot-ai-editor, and accounting for continuous update performance changes with OpenAI (where the results may differ enough that test results could be confusing). Implementation of these static models might involve the use of things like privateGPT or similar tooling.

miltondp commented 1 year ago

@d33bs, this is a really great idea, thank you for opening this issue! I already requested access and I'm downloading the models to give it a quick try :)

~The model files might be too big to download within GitHub Actions (that's the advantage of the API). But this is definitely something we'd like to do in the future if time allows (since it was not originally included in the Sloan grant).~

miltondp commented 1 year ago

Sorry for the confusion. I realized that you refer to the testing part, and I think it's a great idea to use explicit snapshots when referencing models. Nonetheless, as I mentioned here, I think testing prompts should use more advanced tools such as OpenAI Evals instead of simple unit testing as now.

miltondp commented 1 year ago

Ok, I couldn't resist the temptation of trying Llama 2 :) I downloaded the 7B-chat model (I think I need more than one GPU for the other models) and it produced some very interesting revisions, for instance, of this abstract:

image

Llama 2's API for chat completion is very similar to the OpenAI's one, so it shouldn't be hard to implement. And as you suggest, this could greatly improve prompt testing, for instance, maybe by providing a pilot model before using an expensive one.

miltondp commented 1 year ago

Another option I just realized is to use a very low temperature, even zero.

d33bs commented 1 year ago

Cheers, thanks for these thoughts and the exploration @miltondp ! The temperature parameter seems like a great way to ensure greater determinism. Exciting that the Llama 2 API is similar to OpenAI's; I wonder how much of this could be abstracted for the usecases here (maybe there's already a unified way to interact with these?).

Generally I felt concerned that some OpenAI tests may shift in their output with continuous updates applied and no way to statically define a version. I'm unsure about how this works for them, but it feels like some of their endpoints are akin to a Docker image :latest tag (where it's sometimes a best practice to specify a version for software deployments). While the latest version is often what we want for experimentation, cutting-edge changes can sometimes lead to unexpected results. For example, latest changes to the models could incur different output even with 0 temperature output, meaning a test might not pass from one moment to the next.

Here I also wondered about additional advantages of open-source models (mostly thinking about Llama here, but there are others):

miltondp commented 1 year ago

I agree that referencing versions of models would be ideal. I understand that OpenAI offers this as "snapshots" of models, at least for some of them such as gpt-3.5-turbo. They also include a deprecation date after new models are released, but I think you are not forced to move to the next model version when that happens.

I completely agree with all of your points. I think I will make the Llama 2 support a high priority for the next iterations because it can reduce costs in accessing the OpenAI API, at least to test prompts preliminarily, as well as reducing developing time as you said. And I'd love to see the tool being used by colleagues from Argentina, for instance, where the costs of proprietary models are sometimes prohibitive.

Great comments and ideas. Thank you again, @d33bs!

castedo commented 1 week ago

51 might be useful here