langchain-ai / langchain-weaviate

MIT License
28 stars 9 forks source link

Weaviate should allow the flexibility for the user to mention what vectorizer module that they want to use #95

Open pashva opened 4 months ago

pashva commented 4 months ago

I was using langchain weaviate modules as my library to manage my weaviate storage. But the main problem was that I wanted to use weaviate's local text2vec transformers but in langchain there was no way to pass this argument to make sure that particular documents are embedded with particular vectorizers.

Weaviate allows users to mention a key value pair of vectorizer while creating a class so that users can leverage local vectorization or basically vectorization of their choice for each class.

Currently this is not implemented in langchain and only a default type schema gets created with a singular data property when using the from_documents or from_texts function calls.

Solution: Allow an optional user defined vectorizer field

I have implemented this, should I create a PR? https://github.com/langchain-ai/langchain/pull/16795 It was closed here and was asked to check this repository out

hsm207 commented 4 months ago

hey @pashva,

Thanks again for your interest to contribute! I would like to learn more about your use case but first, to answer your question about whether we should port over your PR:

Looks like we already have an example of what happens when we want to allow customisation to the default schema that langchain creates, as discussed in #94. I think a solution where users create their desired schema themselves, and then tell langchain the schema name, is much cleaner than extending the init method with more params.

What do you think?

As for your use case, I understand that you want to have a local embeddings model so weaviate's text2vec transformers module is a great choice. However, since you're using langchain, why not use their HuggingFaceEmbeddings class with sentence_transformers?

StreetLamb commented 2 months ago

Hi @hsm207, I agree that defining the schema with the Weaviate client and integrating it with Langchain is a better approach. For my use case, I plan to use Weaviate as a retrieval tool for my agents, which is why I prefer the langchain-weaviate over Weaviate stand-alone. Additionally, I want to offer my users the flexibility to choose their vectoriser, such as using Langchain's OpenAIEmbeddings() or a local embedding model. Currently, this level of customisation is not supported in langchain-weaviate.

hsm207 commented 2 months ago

@StreetLamb thanks for clarifying your use case.

I plan to use Weaviate as a retrieval tool for my agents

I'm not very familiar with other parts of langchain. Do you mean you're going to create a custom tool so that a langchain agent can use weaviate to do retrieval?

StreetLamb commented 2 months ago

@hsm207 Yes but it should already be possible to create a Weaviate retriever and use langchain API to create the retriever tool instead of creating a custom tool. Just to share, I tried langchain-chroma in the past, and the ability to customise the collection using their client and use langchain-chroma to reference the collection via name was helpful.

hsm207 commented 1 month ago

@StreetLamb I'm looping in @efriis for input on what changes are needed in langchain-weaviate in order to have the langchain agent + chroma feature you described.

efriis commented 1 month ago

Howdy! You should be able to define an embedding model (which I think is what you're calling a vectoriser), and make a weaviate retriever tool with

weaviate_vectorstore = WeaviateVectorStore(embedding=OpenAIEmbeddings())
create_retriever_tool(weaviate_vectorstore.as_retriever(), ...)

If that's not the case, feel free to reopen, as that would probably be a bug.

StreetLamb commented 1 month ago

Hi @efriis, sorry there might have been some confusion. The challenge I am facing is that I cannot specify the use of Weaviate modules to do the vectorisation:

weaviate_vectorstore = WeaviateVectorStore(embedding=OpenAIEmbeddings())

Using OpenAI's embedding model when DEFAULT_VECTORIZER_MODULE: 'multi2vec-clip' is set in my docker-compose.yml will cause a conflict since the default schema created by langchain assumes no weaviate modules is being used for vectorisation. See #177.

efriis commented 1 month ago

Got it. @hsm207 I tend to agree that langchain support even with that setting is relevant in order to make it usable with other components (e.g. as a retrieval tool for an agent), and I'll defer to you to determine what's best for the weaviate integration package!

pashva commented 1 month ago

I have the implementation ready that I use for myself, if needed can contribute to this repository @hsm207

hsm207 commented 1 month ago

@pashva sure, that contribution would be great.

pashva commented 1 month ago

@hsm207 I have created a PR for the same, hopefully it solves our purpose @StreetLamb

PR: https://github.com/langchain-ai/langchain-weaviate/pull/179