langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.74k stars 15.12k forks source link

There seems to be a bug with OpenAIEmbeddings. #27448

Open fenggaobj opened 3 days ago

fenggaobj commented 3 days ago

Checked other resources

Example Code

from langchain_openai import OpenAIEmbeddings

#ollama
embeddings = OpenAIEmbeddings(
    openai_api_base='http://localhost:11434/v1',
    model="nomic-embed-text",
)

vector = embeddings.embed_query("hello")
print(vector[:3])

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/workspace/mywork/langchainproject/test6.py", line 13, in <module>
    vector = embeddings.embed_query("hello")
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/langchain_openai/embeddings/base.py", line 629, in embed_query
    return self.embed_documents([text])[0]
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/langchain_openai/embeddings/base.py", line 588, in embed_documents
    return self._get_len_safe_embeddings(texts, engine=engine)
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/langchain_openai/embeddings/base.py", line 483, in _get_len_safe_embeddings
    response = self.client.create(
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/openai/resources/embeddings.py", line 124, in create
    return self._post(
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/openai/_base_client.py", line 1277, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/openai/_base_client.py", line 954, in request
    return self._request(
  File "/workspace/download/miniforge/envs/langchain/lib/python3.9/site-packages/openai/_base_client.py", line 1058, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'error': {'message': 'invalid input type', 'type': 'api_error', 'param': None, 'code': None}}

Description

The Embeddings.create method provided by OpenAI supports input parameters of type Union[str, List[str], Iterable[int], Iterable[Iterable[int]]]. However, in the langchain OpenAIEmbeddings class, the _get_len_safe_embeddings method uses _tokenize which may return a type of List[Union[List[int], str]]. This is not a supported type for Embeddings.create.

I believe this to be a bug. Could you please advise on how to handle this issue?

System Info

from langchain_core import sys_info sys_info.print_sys_info()

System Information

OS: Linux OS Version: #1 SMP Thu Apr 7 21:37:58 CST 2022 Python Version: 3.9.20 | packaged by conda-forge | (main, Sep 30 2024, 17:49:10) [GCC 13.3.0]

Package Information

langchain_core: 0.3.10 langchain: 0.3.3 langchain_community: 0.3.2 langsmith: 0.1.135 langchain_experimental: 0.3.2 langchain_openai: 0.2.2 langchain_text_splitters: 0.3.0

Optional packages not installed

langgraph langserve

Other Dependencies

aiohttp: 3.10.10 async-timeout: 4.0.3 dataclasses-json: 0.6.7 httpx: 0.27.2 jsonpatch: 1.33 numpy: 1.26.4 openai: 1.51.2 orjson: 3.10.7 packaging: 24.1 pydantic: 2.9.2 pydantic-settings: 2.5.2 PyYAML: 6.0.2 requests: 2.32.3 requests-toolbelt: 1.0.0 SQLAlchemy: 2.0.36 tenacity: 8.5.0 tiktoken: 0.8.0 typing-extensions: 4.12.2

ethanglide commented 2 days ago

I'm not sure if I am misunderstanding what is going on here, but this works just fine:

from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    base_url='http://localhost:11434', # optional
    model='nomic-embed-text'
)

vector = embeddings.embed_query("hello")
print(vector[:3])

Of course, so long as you have your base_url correct and have ollama with the nomic-embed-text model pulled.

wuyue92tree commented 1 day ago

It's because ollama not support the data structure yet.

https://github.com/ollama/ollama/blob/main/docs/openai.md#v1embeddings

fenggaobj commented 5 hours ago

@ethanglide @wuyue92tree Thank you very much for your assistance. I am pleased to inform you that OllamaEmbeddings is functioning properly. However, I have encountered some issues with OpenAIEmbeddings.

The problem lies in the _get_len_safe_embeddingsmethod within the langchain_openai/embeddings/base.py file. When this method calls the createmethod in openai/resources/embeddings.py, it provides a parameter of type List[Union[List[int], str]]. Unfortunately, this type is not supported by the createmethod. The supported types for the create method are Union[str, List[str], Iterable[int], Iterable[Iterable[int]]], excluding List[Union[List[int]].

Here is the implementation code for the create method in openai/resources/embeddings.py:

class Embeddings(SyncAPIResource):  
    def create(  
        self,  
        *,  
        input: Union[str, List[str], Iterable[int], Iterable[Iterable[int]]],   #excluding `List[Union[List[int]].` from  _tokenize in the langchain code
        model: Union[str, EmbeddingModel],  
        #.....................................  
    ) -> CreateEmbeddingResponse:

And here is the implementation code for the _get_len_safe_embeddingsmethod in langchain_openai/embeddings/base.py:

def _get_len_safe_embeddings(  
        self, texts: List[str], *, engine: str, chunk_size: Optional[int] = None  
    ):  
        _chunk_size = chunk_size or self.chunk_size  
        _iter, tokens, indices = self._tokenize(texts, _chunk_size)  
        batched_embeddings: List[List[float]] = []  
        for i in _iter:  
            response = self.client.create(  
                input=tokens[i : i + _chunk_size], **self._invocation_params  
            )  
            #....................................  

    def _tokenize(  
        self, texts: List[str], chunk_size: int  
    ) -> Tuple[Iterable[int], List[Union[List[int], str]], List[int]]:  
        #....................................

In this context, the tokens returned by self._tokenize have a type of List[Union[List[int], str]].

I hope this detailed explanation helps in addressing the issue. Thank you once again for your help.