openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.27k stars 830 forks source link

ValueError: Unknown encoding text-embedding-ada-002 #155

Closed heavenkiller2018 closed 1 year ago

heavenkiller2018 commented 1 year ago

when running the following code:

from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()
embeddings = embedding_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)

such errors occurred:

ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 embeddings = embedding_model.embed_documents(
      2     [
      3         "Hi there!",
      4         "Oh, hello!",
      5         "What's your name?",
      6         "My friends call me World",
      7         "Hello World!"
      8     ]
      9 )
     10 len(embeddings), len(embeddings[0])

File ~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/embeddings/openai.py:305, in OpenAIEmbeddings.embed_documents(self, texts, chunk_size)
    293 """Call out to OpenAI's embedding endpoint for embedding search docs.
    294 
    295 Args:
   (...)
    301     List of embeddings, one for each text.
    302 """
    303 # NOTE: to keep things simple, we assume the list may contain texts longer
    304 #       than the maximum context and use length-safe embedding function.
--> 305 return self._get_len_safe_embeddings(texts, engine=self.deployment)

File ~/micromamba/envs/openai/lib/python3.11/site-packages/langchain/embeddings/openai.py:225, in OpenAIEmbeddings._get_len_safe_embeddings(self, texts, engine, chunk_size)
    223 tokens = []
    224 indices = []
--> 225 encoding = tiktoken.get_encoding(self.model)
    226 for i, text in enumerate(texts):
    227     if self.model.endswith("001"):
    228         # See: https://github.com/openai/openai-python/issues/418#issuecomment-1525939500
    229         # replace newlines, which can negatively affect performance.

File ~/micromamba/envs/openai/lib/python3.11/site-packages/tiktoken/registry.py:60, in get_encoding(encoding_name)
     57     assert ENCODING_CONSTRUCTORS is not None
     59 if encoding_name not in ENCODING_CONSTRUCTORS:
---> 60     raise ValueError(f"Unknown encoding {encoding_name}")
     62 constructor = ENCODING_CONSTRUCTORS[encoding_name]
     63 enc = Encoding(**constructor())

ValueError: Unknown encoding text-embedding-ada-002

how to fix it?

hauntsaninja commented 1 year ago

Looks like a bug in langchain, please report there.