langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.21k stars 14.71k forks source link

Incorrect passing of scores for sorting in CrossEncoderReranker #22556

Open NikitaKlichko opened 3 months ago

NikitaKlichko commented 3 months ago

Checked other resources

Example Code

from langchain_community.cross_encoders import HuggingFaceCrossEncoder

re_rank_model_name = "amberoad/bert-multilingual-passage-reranking-msmarco"
model_kwargs = {
                'device': device, 
                'trust_remote_code':True,
                }
re_rank_model = HuggingFaceCrossEncoder(model_name=re_rank_model_name, 
                                   model_kwargs = model_kwargs,
                                   )

from langchain.retrievers.document_compressors import CrossEncoderReranker
compressor = CrossEncoderReranker(model=re_rank_model, top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever,
)

Error Message and Stack Trace (if applicable)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File */lib/python3.10/site-packages/langchain_core/retrievers.py:194, in BaseRetriever.invoke(self, input, config, **kwargs)
    175 """Invoke the retriever to get relevant documents.
    176 
    177 Main entry point for synchronous retriever invocations.
   (...)
    191     retriever.invoke("query")
    192 """
    193 config = ensure_config(config)
--> 194 return self.get_relevant_documents(
    195     input,
    196     callbacks=config.get("callbacks"),
    197     tags=config.get("tags"),
    198     metadata=config.get("metadata"),
    199     run_name=config.get("run_name"),
    200     **kwargs,
    201 )

File *lib/python3.10/site-packages/langchain_core/_api/deprecation.py:148, in deprecated.<locals>.deprecate.<locals>.warning_emitting_wrapper(*args, **kwargs)
    146     warned = True
    147     emit_warning()
...
     47 docs_with_scores = list(zip(documents, scores))
---> 48 result = sorted(docs_with_scores, key=operator.itemgetter(1), reverse=True)
     49 return [doc for doc, _ in result[: self.top_n]]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Description

Incorrect passing of scores for sorting. The classifier returns logits for the dissimilarity and similarity between the query and the document. You need to add an exception and take the middle value if the model produces two scores, otherwise leave it as isю This is a bug?

System Info

System Information

OS: Linux OS Version: #172-Ubuntu SMP Fri Jul 7 16:10:02 UTC 2023 Python Version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0]

Package Information

langchain_core: 0.2.3 langchain: 0.2.1 langchain_community: 0.2.1 langsmith: 0.1.69 langchain_chroma: 0.1.1 langchain_openai: 0.1.8 langchain_text_splitters: 0.2.0 langchainhub: 0.1.17

keenborder786 commented 3 months ago

Some models, such as amberoad/bert-multilingual-passage-reranking-msmarco, provide scores in pairs: <not-relevant-score, relevant-score>. However, the current HuggingFaceCrossEncoder does not account for this pair of scores. To address this, I have created a pull request that modifies the encoder to consider only the relevant score.

Please refer to this comment for more details.

NikitaKlichko commented 3 months ago

Thanks for fix

keenborder786 commented 3 months ago

@NikitaKlichko please keep the issue opened until PR is merged.