langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
95.28k stars 15.46k forks source link

Azure rejects tokens sent by OpenAIEmbeddings, expects strings #6793

Closed billsanto closed 1 year ago

billsanto commented 1 year ago

System Info

Langchain .216, OS X 11.6, Python 3.11.

Who can help?

No response

Information

Related Components

Reproduction

  1. Setup OpenAIEmbeddings method with Azure arguments
  2. Split text with a splitter like RecursiveCharacterTextSplitter
  3. Use text and embedding function in chroma.from_texts
import openai
import os
from dotenv import load_dotenv, find_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

_ = load_dotenv(find_dotenv())

API_KEY = os.environ.get('STAGE_API_KEY')
API_VERSION = os.environ.get('API_VERSION')
RESOURCE_ENDPOINT = os.environ.get('RESOURCE_ENDPOINT')

openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = API_VERSION
openai.log = "debug"

sample_text = 'This metabolite causes atherosclerosis in the liver[55]. Strengths and limitations This is the first thorough bibliometric analysis of nutrition and gut microbiota research conducted on a global level.'

embed_deployment_id = 'text-embedding-ada-002'
embed_model = 'text-embedding-ada-002'

persist_directory = "./storage_openai_chunks"  # will be created if not existing

embeddings = OpenAIEmbeddings(
    deployment=embed_deployment_id,
    model=embed_model,
    openai_api_key=API_KEY,
    openai_api_base=RESOURCE_ENDPOINT,
    openai_api_type="azure",
    openai_api_version=API_VERSION,
)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=40, chunk_overlap=10)
texts = text_splitter.split_text(sample_text)

vectordb = Chroma.from_texts(collection_name='test40',
                             texts=texts,
                             embedding=embeddings,
                             persist_directory=persist_directory)

vectordb.persist()
print(vectordb.get())

message='Request to OpenAI API' method=post path=https://***/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15
api_version=2023-05-15 data='{"input": [[2028, 28168, 635, 11384, 264, 91882, 91711], [258, 279, 26587, 58, 2131, 948, 32937, 82, 323], [438, 9669, 1115, 374, 279, 1176], [1820, 1176, 17879, 44615, 24264], [35584, 315, 26677, 323, 18340], [438, 18340, 53499, 6217, 3495, 13375], [444, 55015, 389, 264, 3728, 2237, 13]], "encoding_format": "base64"}' message='Post details'
message='OpenAI API response' path=https://***/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 processing_ms=None request_id=None response_code=400
body='{\n  "error": "/input/6 expected type: String, found: JSONArray\\n/input/5 expected type: String, found: JSONArray\\n/input/4 expected type: String, found: JSONArray\\n/input/3 expected type: String, found: JSONArray\\n/input/2 expected type: String, found: JSONArray\\n/input/1 expected type: String, found: JSONArray\\n/input/0 expected type: String, found: JSONArray\\n/input expected: null, found: JSONArray\\n/input expected type: String, found: JSONArray"\n}' headers="{'Date': 'Tue, 27 Jun 2023 00:08:56 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Content-Length': '454', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=16070400; includeSubDomains', 'Set-Cookie': 'TS01bd4155=0179bf738063e38fbf3fffb70b7f9705fd626c2df1126f29599084aa69d137b77c61d6377a118a5ebe5a1f1f9f314c22a777a0e861; Path=/; Domain=.***', 'Vary': 'Accept-Encoding'}" message='API response body'
Traceback (most recent call last):
  File "/Users/A/dev/python/openai/langchain_embed_issue.py", line 39, in <module>
    vectordb = Chroma.from_texts(collection_name='test40',
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 403, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 148, in add_texts
    embeddings = self._embedding_function.embed_documents(list(texts))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 465, in embed_documents
    return self._get_len_safe_embeddings(texts, engine=self.deployment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 302, in _get_len_safe_embeddings
    response = embed_with_retry(
               ^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 97, in embed_with_retry
    return _embed_with_retry(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/tenacity/__init__.py", line 289, in wrapped_f
    return self(f, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/tenacity/__init__.py", line 379, in __call__
    do = self.iter(retry_state=retry_state)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/tenacity/__init__.py", line 314, in iter
    return fut.result()
           ^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/tenacity/__init__.py", line 382, in __call__
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 95, in _embed_with_retry
    return embeddings.client.create(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
    response = super().create(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
    response, _, api_key = requestor.request(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_requestor.py", line 298, in request
    resp, got_stream = self._interpret_response(result, stream)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_requestor.py", line 700, in _interpret_response
    self._interpret_response_line(
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_requestor.py", line 763, in _interpret_response_line
    raise self.handle_error_response(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/A/anaconda3/envs/openai1/lib/python3.11/site-packages/openai/api_requestor.py", line 418, in handle_error_response
    error_code=error_data.get("code"),
               ^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

Process finished with exit code 1

Expected behavior

OpenAIEmbeddings should return embeddings instead of an error.

Because Azure currently only accepts str input, in contrast to OpenAI which accepts tokens or strings, the input is rejected because OpenAIEmbeddings sends tokens only. Azure embedding API docs confirm this, where the request body input parameter is of type string: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/reference#embeddings

Second, after modifying openai.py to send strings, Azure complains that it currently accepts one input at a time--in other words, it doesn't accept batches of strings (or even tokens if it accepted tokens). So the for loop increment was modified to send one decoded batch of tokens (in other words, the original str chunk) at a time.

Modifying embeddings/openai.py with:

        # batched_embeddings = []
        # _chunk_size = chunk_size or self.chunk_size
        # for i in range(0, len(tokens), _chunk_size):
        #     response = embed_with_retry(
        #         self,
        #         input=tokens[i : i + _chunk_size],
        #         **self._invocation_params,
        #     )
        #     batched_embeddings += [r["embedding"] for r in response["data"]]

        batched_embeddings = []
        _chunk_size = chunk_size or self.chunk_size if 'azure' not in self.openai_api_type else 1
        # 
        # 
        for i in range(0, len(tokens), _chunk_size):
            embed_input = encoding.decode(tokens[i]) if 'azure' in self.openai_api_type else tokens[i : i + _chunk_size]
            response = embed_with_retry(
                self,
                input=embed_input,
                **self._invocation_params,
            )
            batched_embeddings += [r["embedding"] for r in response["data"]]

and re-running the code:

# same code
...
message='Request to OpenAI API' method=post path=https://***/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15
api_version=2023-05-15 data='{"input": "This metabolite causes atherosclerosis", "encoding_format": "base64"}' message='Post details'
message='OpenAI API response' path=https://***/openai/deployments/text-embedding-ada-002/embeddings?api-version=2023-05-15 processing_ms=27.0109 request_id=47bee143-cb00-4782-8560-f267ee839af4 response_code=200
body='{\n  "object": "list",\n  "data": [\n    {\n      "object": "embedding",\n      "index": 0,\n      "embedding": "5zPWu+V2e7w75Ia7HeCavKhhE71NQhA865WYvE+Y9DuB8ce8Xak7uhgQgble4z48H8L4uyePnzu2XVq8ucg+u7ZdWj28ofq7Jzd6PMFMkbvQiIq8nbuwPFJMLTxGe5i83c2lPIXQsjzPToc8taB/vZlZ7ryVjwM8jsiLPIvLfrywnBG9RjLEO2XkuTpOMz

... (removed for brevity)

/gP7uzTTC8RZf5PMOULTv2D4C7caQfvR60EbyqjZ48yqxUuzHeLzhSFJW8qDu5uwcj7zyeDnO8UMKvPNLEezxNixm6X7U3vBeDqzumrI08jzQqPDZObLzZS2c843itO9a+y7w+mJG8gChjPAIHqLqEeLg6ysUTvfqaizzT2yo77Di/u3A3azyziva8ct9VvI80Kry1n5U7ipJvvHy2FjuAQSK9"\n    }\n  ],\n  "model": "ada",\n  "usage": {\n    "prompt_tokens": 7,\n    "total_tokens": 7\n  }\n}\n' headers="{'Date': 'Tue, 27 Jun 2023 00:20:13 GMT', 'Content-Type': 'application/json', 'Content-Length': '8395', 'Connection': 'keep-alive', 'x-ms-region': 'East US', 'apim-request-id': 'b932333d-1eb9-415a-a84b-da1c5f95433b', 'x-content-type-options': 'nosniff, nosniff', 'openai-processing-ms': '26.8461', 'access-control-allow-origin': '*', 'x-request-id': '0677d084-2449-486c-9bff-b6ef07df004f', 'x-ms-client-request-id': 'b932333d-1eb9-415a-a84b-da1c5f95433b', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload, max-age=16070400; includeSubDomains', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block'}" message='API response body'
{'ids': ['60336172-1480-11ee-b223-acde48001122', '6033621c-1480-11ee-b223-acde48001122', '60336280-1480-11ee-b223-acde48001122', '603362b2-1480-11ee-b223-acde48001122', '603362da-1480-11ee-b223-acde48001122', '603362f8-1480-11ee-b223-acde48001122', '60336370-1480-11ee-b223-acde48001122'], 'embeddings': None, 'documents': ['This metabolite causes atherosclerosis', 'in the liver[55]. Strengths and', 'and limitations This is the first', 'the first thorough bibliometric', 'analysis of nutrition and gut', 'and gut microbiota research conducted', 'conducted on a global level.'], 'metadatas': [None, None, None, None, None, None, None]}

Also made the following change to openai.py a few lines later, although this is untested:

        batched_embeddings = []
        _chunk_size = chunk_size or self.chunk_size if 'azure' not in self.openai_api_type else 1
        # azure only accepts str input, currently one list element at a time
        for i in range(0, len(tokens), _chunk_size):
            embed_input = encoding.decode(tokens[i]) if 'azure' in self.openai_api_type else tokens[i : i + _chunk_size]
            response = await async_embed_with_retry(
                self,
                input=embed_input,
                **self._invocation_params,
            )
            batched_embeddings += [r["embedding"] for r in response["data"]]
dosubot[bot] commented 1 year ago

Hi, @billsanto! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you reported an issue where Azure rejects tokens sent by OpenAIEmbeddings because it expects strings. You tried modifying the code to send strings instead of tokens, but Azure still complains because it only accepts one input at a time.

Since there hasn't been any activity or comments on this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

skoulik commented 6 months ago

This is still the real issue. Will it be fixed?