langchain-ai / langchain-google

MIT License
76 stars 79 forks source link

VertexAISearchRetriever is not capable of returning more than one relevant document for unstructured data stores. #230

Closed akos-sch closed 2 weeks ago

akos-sch commented 1 month ago

The current implementation only supports the retrieval of the most relevant document for unstructured documents:

On top of this, this is hidden, as with unstructured data stores, the returned chunk type is always extractive segments:

def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        """Get documents relevant for a query."""

        search_request = self._create_search_request(query)

        try:
            response = self._client.search(search_request)
        except InvalidArgument as exc:
            raise type(exc)(
                exc.message
                + " This might be due to engine_data_type not set correctly."
            )

        if self.engine_data_type == 0:
            chunk_type = (
                "extractive_answers"
                if self.get_extractive_answers
                else "extractive_segments"
...

With 0 for engine_data_type for unstructured data stores and the default of False for get_extractive_answers, chunk_type will only take "extractive_segments".

Field:

"""
The maximum number of extractive answers returned in each search result.
At most 5 answers will be returned for each SearchResult.
"""
max_extractive_segment_count: int = Field(default=1, ge=1, le=1)

This field can not be set to anything else but 1.

Content spec setting: engine_data_type=0 for the data store, and as said before by default get_extractive_answers is set to False.

def _get_content_spec_kwargs(self) -> Optional[Dict[str, Any]]:
        """Prepares a ContentSpec object."""

        from google.cloud.discoveryengine_v1beta import SearchRequest

        if self.engine_data_type == 0:
            if self.get_extractive_answers:
                extractive_content_spec = (
                    SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
                        max_extractive_answer_count=self.max_extractive_answer_count,
                    )
                )
            else:
                extractive_content_spec = (
                    SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
                        max_extractive_segment_count=self.max_extractive_segment_count,
                    )
                )
...

This means the content specification is set with 1 segment count.

There is an option to set max_documents. However, when I tested different values, I always got one document as a response.

Code snippet used:

retriever = VertexAISearchRetriever(
    project_id=PROJECT_ID,
    location_id=DS_LOCATION_ID,
    data_store_id=DATA_STORE_ID,
    max_documents=10,
    engine_data_type=DATA_STORE_TYPE, # set to 0 for unstructured data store
)

retriever_tool = create_retriever_tool(
    retriever=retriever, 
    name=RETRIEVER_NAME, 
    description=RETRIEVER_DESCR
)

docs = retriever_tool.invoke({"query": question})

I use version "1.0.3" of langchain-google-community.

Could you fix this?

akos-sch commented 2 weeks ago

I received a tip from a Google contact about a possible data store/ranking issue specific to the data I set up the data store with. I created a different data store with other data and managed to retrieve multiple documents. Seems like there weren't many relevant documents with the queries I tried. Therefore, I consider this issue closed.