run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.62k stars 5.24k forks source link

[Bug]: ValidationError: 1 validation error for EmbeddingEndEvent embeddings -> 0 value is not a valid list (type=type_error.list) #12913

Closed NarasimmanSaravana1994 closed 3 months ago

NarasimmanSaravana1994 commented 6 months ago

Bug Description

I create a custom embedding model using InstructorEmbedding (reference:https://docs.llamaindex.ai/en/stable/examples/embeddings/custom_embeddings/)

After embedding model was generated, just validate the with some text exception occurs

code :

import openai import os from typing import Any, List from InstructorEmbedding import INSTRUCTOR

from llama_index.core.bridge.pydantic import PrivateAttr from llama_index.core.embeddings import BaseEmbedding

class InstructorEmbeddings(BaseEmbedding): _model: INSTRUCTOR = PrivateAttr() _instruction: str = PrivateAttr()

def __init__(
    self,
    instructor_model_name: str = "hkunlp/instructor-large",
    instruction: str = "Represent a document for semantic search:",
    **kwargs: Any,
) -> None:
    self._model = INSTRUCTOR(instructor_model_name)
    self._instruction = instruction
    super().__init__(**kwargs)

@classmethod
def class_name(cls) -> str:
    return "instructor"

async def _aget_query_embedding(self, query: str) -> List[float]:
    return self._get_query_embedding(query)

async def _aget_text_embedding(self, text: str) -> List[float]:
    return self._get_text_embedding(text)

def _get_query_embedding(self, query: str) -> List[float]:
    embeddings = self._model.encode([[self._instruction, query]])
    return embeddings[0]

def _get_text_embedding(self, text: str) -> List[float]:
    embeddings = self._model.encode([[self._instruction, text]])
    return embeddings[0]

def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
    embeddings = self._model.encode(
        [[self._instruction, text] for text in texts]
    )
    return embeddings

embed_model = InstructorEmbeddings(embed_batch_size=2)

embeddings = embed_model.get_text_embedding('my name is narasimman')

embedding

Version

llama-index 0.10.9

Steps to Reproduce

Above mentioned comments, I shared the code to reproduce,

Relevant Logs/Tracbacks

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[29], line 1
----> 1 embeddings = embed_model.get_text_embedding('my name is narasimman')

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py:102, in Dispatcher.span.<locals>.wrapper(*args, **kwargs)
    100     result = func(*args, **kwargs)
    101 except Exception as e:
--> 102     self.span_drop(id=id, err=e)
    103 else:
    104     self.span_exit(id=id, result=result)

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py:77, in Dispatcher.span_drop(self, id, err, **kwargs)
     75 while c:
     76     for h in c.span_handlers:
---> 77         h.span_drop(id, err, **kwargs)
     78     if not c.propagate:
     79         c = None

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\instrumentation\span_handlers\base.py:45, in BaseSpanHandler.span_drop(self, id, err, **kwargs)
     43 def span_drop(self, id: str, err: Optional[Exception], **kwargs) -> None:
     44     """Logic for dropping a span i.e. early exit."""
---> 45     self.prepare_to_drop_span(id, err, **kwargs)
     46     if self.current_span_id == id:
     47         self.current_span_id = self.open_spans[id].parent_id

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\instrumentation\span_handlers\null.py:33, in NullSpanHandler.prepare_to_drop_span(self, id, err, **kwargs)
     31 """Logic for droppping a span."""
     32 if err:
---> 33     raise err

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\instrumentation\dispatcher.py:100, in Dispatcher.span.<locals>.wrapper(*args, **kwargs)
     98 self.span_enter(id=id, **kwargs)
     99 try:
--> 100     result = func(*args, **kwargs)
    101 except Exception as e:
    102     self.span_drop(id=id, err=e)

File D:\POC\LLM\venv\Lib\site-packages\llama_index\core\base\embeddings\base.py:235, in BaseEmbedding.get_text_embedding(self, text)
    227     text_embedding = self._get_text_embedding(text)
    229     event.on_end(
    230         payload={
    231             EventPayload.CHUNKS: [text],
    232             EventPayload.EMBEDDINGS: [text_embedding],
    233         }
    234     )
--> 235 dispatcher.event(EmbeddingEndEvent(chunks=[text], embeddings=[text_embedding]))
    236 return text_embedding

File D:\POC\LLM\venv\Lib\site-packages\pydantic\v1\main.py:341, in BaseModel.__init__(__pydantic_self__, **data)
    339 values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
    340 if validation_error:
--> 341     raise validation_error
    342 try:
    343     object_setattr(__pydantic_self__, '__dict__', values)

ValidationError: 1 validation error for EmbeddingEndEvent
embeddings -> 0
  value is not a valid list (type=type_error.list)
dosubot[bot] commented 6 months ago

The issue you're facing is due to the EmbeddingEndEvent expecting a list of lists of floats (List[List[float]]) for the embeddings, but it's receiving a single list of floats instead. To fix this, you need to adjust the _get_text_embedding method in your InstructorEmbeddings class to ensure it returns a list of lists. Here's the corrected method:

def _get_text_embedding(self, text: str) -> List[List[float]]:
    embeddings = self._model.encode([[self._instruction, text]])
    return [embeddings[0]]  # Wrap the embedding in a list

This change wraps the embedding in an additional list, matching the expected format. If you're using a similar pattern for query embeddings, apply the same fix to those methods as well. This should resolve the validation error you're encountering.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
NarasimmanSaravana1994 commented 6 months ago

class InstructorEmbeddings(BaseEmbedding): _model: INSTRUCTOR = PrivateAttr() _instruction: str = PrivateAttr()

def __init__(
    self,
    instructor_model_name: str = "hkunlp/instructor-large",
    instruction: str = "Represent a document for semantic search:",
    **kwargs: Any,
) -> None:
    self._model = INSTRUCTOR(instructor_model_name)
    self._instruction = instruction
    super().__init__(**kwargs)

@classmethod
def class_name(cls) -> str:
    return "instructor"

async def _aget_query_embedding(self, query: str) -> List[float]:
    return self._get_query_embedding(query)

async def _aget_text_embedding(self, text: str) -> List[float]:
    return self._get_text_embedding(text)

def _get_query_embedding(self, query: str) -> List[float]:
    embeddings = self._model.encode([[self._instruction, query]])
    return embeddings[0]

def _get_text_embedding(self, text: str) -> List[List[float]]:
    embeddings = self._model.encode([[self._instruction, text]])
    return [embeddings[0]]  # Wrap the embedding in a list

 still, I faced the same issue even I changed the code above the trailing comment