Closed tmc closed 10 months ago
Happy to work on this within the next couple of weeks.
One interesting challenge is that Gemini are multimodal - they accept images as well as text as input
Yes, I have the start of image support in https://github.com/tmc/langchaingo/pull/361 that I'd like to extend to support ollama's multi-modal support (in addition to Gemini)
A key issue is that the llms.LLM
interface has only []string
for its prompts, and a lot of code depends on it. We need to either break it and rewrite existing code, or add another interface.
The chat interfaces supply a slightly richer surface area that we can extend but yes I agree we need a more flexible interface here. As we're pre-1.0 I'm open to considering a breaking change, it's pretty clear that text-only prompting is isn't sufficient.
I still want to spend some time studying the chat interfaces; they could be ripe for a significant refactoring. Chats shouldn't have different interfaces from LLMs - they should just build on top. A chat should just wrap an LLM with some history/context. Haven't found time for this yet, though - but hopefully soon.
Certainly open to collaborating on arriving at a good design!
happy to help here! I am GDE in ML
Just a note that #465 was opened to specifically discuss the new interfaces needed to support multi-modal input. We'll design something that works for both OpenAI and Gemini and hopefully other models as well.
Once #465 is done, adding a Google Gemini backend should be straightforward. I think the following plan makes sense:
vertexai
backend since it's tuned for the old text-only PaLM model@eliben I would suggest to use VertexAI SDK since the Google AI SDK is not available in all the countries yet
@xavidop step (2) mentions that both can be supported via this interface. The user should be able to choose which to use when a client is created.
Would you like to help with this once the initial new interface from (1) is in place?
The hard part about this is that the two interfaces are quite different. You have to run them in parallel. I hacked something together with this approach and it looks like this: https://github.com/mrothroc/langchaingo/blob/3ecb9c417aa8777496fc378be35fa2271ea3f68b/llms/vertexai/internal/common/vertex_client.go#L27
(The code is a little disorganized since it just a hack, but you can see in general that you need both clients. The unit tests work though, so you can step through it to see it work.)
If anyone can suggest a better way to go about this, I'm open to it. I solved this problem in another private project by just using straight REST calls instead of the library, but I think that will lead to issues down the road.
The hard part about this is that the two interfaces are quite different. You have to run them in parallel. I hacked something together with this approach and it looks like this: https://github.com/mrothroc/langchaingo/blob/3ecb9c417aa8777496fc378be35fa2271ea3f68b/llms/vertexai/internal/common/vertex_client.go#L27
The legacy client for PaLM isn't needed anymore, since there's a new SDK for Vertex to interface with the Gemini models: https://pkg.go.dev/cloud.google.com/go/vertexai and it has a compatible interface to the Google generative AI SDK. But yes, having pointers to two potential clients and initializing just one of them based on passed options/parameters sounds like a reasonable approach overall.
The legacy client for PaLM isn't needed anymore, since there's a new SDK for Vertex to interface with the Gemini models: https://pkg.go.dev/cloud.google.com/go/vertexai and it has a compatible interface to the Google generative AI SDK.
I tried just using that to call text-bison but it doesn't appear to be supported via this client. So, if the expectation is that text-bison will still work, you have to run the two in parallel.
Also, it looks like the new API isn't complete. For example, unless I'm missing something, it doesn't appear to do embeddings.
I tried just using that to call text-bison but it doesn't appear to be supported via this client. So, if the expectation is that text-bison will still work, you have to run the two in parallel.
text-bison
is the old PaLM model, and there's no real reason to invoke it now that Gemini is out. gemini-pro
should be good now.
Also, it looks like the new API isn't complete. For example, unless I'm missing something, it doesn't appear to do embeddings.
Indeed, this discrepancy is unfortunate and hopefully temporary. In the meantime, it's OK to use a pointer to the PaLM client in the client type to answer embedding queries.
text-bison
is the old PaLM model, and there's no real reason to invoke it now that Gemini is out.gemini-pro
should be good now.
Switching models seems like a pretty big change. We tried just sending some of our production prompts used with text-bison to gemini-pro and the results were different. Also, text-bison is now officially supported, while gemini-pro is still in preview.
This is just my use case. It would definitely not work in our company. Maybe everyone else can live with that kind of change?
I guess this library is young enough that breaking changes are OK? If so, I'd be inclined to change the factory function that gets the LLM.
Status update here:
Model
interface for GoogeAI (thanks @mrothroc !!)The next step should be adding a parallel implementation in the same provider using the https://pkg.go.dev/cloud.google.com/go/vertexai/genai SDK - it provides largely the same functionality (there are some small differences), but uses GCP authentication instead of API keys. Since this SDK doesn't support embeddings yet, it will use the legacy PaLM client for embeddings but in a manner that should be transparent to users.
With Gemini pro now available, we should have integration + an example.