run-llama / llama_docs_bot

Bottoms Up Development with LlamaIndex - Building a Documentation Chatbot
MIT License
138 stars 44 forks source link

4_embeddings: ValueError: "InstructorEmbeddings" object has no field "_model" #6

Open jonmach opened 9 months ago

jonmach commented 9 months ago

I"m working through the llama_docs_bot files and there is an issue with the InstructorEmbeddings class that relies on BaseEmbedding:

Running the following:

# set the batch size to 1 to avoid memory issues
# if you have a large GPU, you can increase this
instructor_embeddings = InstructorEmbeddings(embed_batch_size=1)

I get the following error:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb) Cell 10 line 3
      [1](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=0) # set the batch size to 1 to avoid memory issues
      [2](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=1) # if you have a large GPU, you can increase this
----> [3](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=2) instructor_embeddings = InstructorEmbeddings(embed_batch_size=1)

[/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb](https://file+.vscode-resource.vscode-cdn.net/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb) Cell 10 line 1
      [6](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=5) def __init__(
      [7](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=6)     self, 
      [8](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=7)     instructor_model_name: str = "hkunlp/instructor-large",
      [9](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=8)     instruction: str = "Represent the Computer Science text for retrieval:",
     [10](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=9)     **kwargs: Any,
     [11](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=10) ) -> None:
---> [12](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=11)     self._model = INSTRUCTOR(instructor_model_name)
     [13](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=12)     self._instruction = instruction
     [14](vscode-notebook-cell:/Users/jon/dev/LLM/LLaMaIndex/llama_docs_bot/4_embeddings/4_embeddings.ipynb#X12sZmlsZQ%3D%3D?line=13)     super().__init__(**kwargs)

File [/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pydantic/main.py:357](https://file+.vscode-resource.vscode-cdn.net/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pydantic/main.py:357), in pydantic.main.BaseModel.__setattr__()

ValueError: "InstructorEmbeddings" object has no field "_model"

This is a list of my installed modules with versions etc.


Package                     Version
--------------------------- ------------
aiohttp                     3.9.1
aiosignal                   1.3.1
aiostream                   0.5.2
alembic                     1.13.0
altair                      5.2.0
annotated-types             0.6.0
anyio                       3.7.1
appdirs                     1.4.4
appnope                     0.1.3
argon2-cffi                 23.1.0
argon2-cffi-bindings        21.2.0
arrow                       1.3.0
asttokens                   2.4.1
async-lru                   2.0.4
attrs                       23.1.0
Babel                       2.13.1
backoff                     2.2.1
beautifulsoup4              4.12.2
bleach                      6.1.0
blinker                     1.7.0
cachetools                  5.3.2
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
cohere                      4.37
comm                        0.2.0
contourpy                   1.2.0
cycler                      0.12.1
dataclasses-json            0.6.3
datasets                    2.15.0
debugpy                     1.8.0
decorator                   5.1.1
defusedxml                  0.7.1
Deprecated                  1.2.14
dill                        0.3.7
distro                      1.8.0
dnspython                   2.4.2
entrypoints                 0.4
executing                   2.0.1
Faker                       20.1.0
fastavro                    1.9.0
fastjsonschema              2.19.0
favicon                     0.7.0
filelock                    3.13.1
fonttools                   4.46.0
fqdn                        1.5.1
frozendict                  2.3.10
frozenlist                  1.4.0
fsspec                      2023.10.0
gitdb                       4.0.11
GitPython                   3.1.40
greenlet                    3.0.1
h11                         0.14.0
htbuilder                   0.6.2
html2text                   2020.1.16
httpcore                    1.0.2
httpx                       0.25.2
huggingface-hub             0.19.4
humanize                    4.9.0
idna                        3.6
importlib-metadata          6.11.0
InstructorEmbedding         1.0.0
ipykernel                   6.27.1
ipython                     8.18.1
ipywidgets                  8.1.1
isoduration                 20.11.0
jedi                        0.19.1
Jinja2                      3.1.2
joblib                      1.3.2
json5                       0.9.14
jsonpatch                   1.33
jsonpointer                 2.4
jsonschema                  4.20.0
jsonschema-specifications   2023.11.2
jupyter                     1.0.0
jupyter_client              8.6.0
jupyter-console             6.6.3
jupyter_core                5.5.0
jupyter-events              0.9.0
jupyter-lsp                 2.2.1
jupyter_server              2.11.2
jupyter_server_terminals    0.4.4
jupyterlab                  4.0.9
jupyterlab_pygments         0.3.0
jupyterlab_server           2.25.2
jupyterlab-widgets          3.0.9
kaggle                      1.5.16
kiwisolver                  1.4.5
langchain                   0.0.348
langchain-core              0.0.12
langsmith                   0.0.69
litellm                     1.11.1
llama-index                 0.9.13
loguru                      0.7.2
lxml                        4.9.3
Mako                        1.3.0
Markdown                    3.5.1
markdown-it-py              3.0.0
markdownlit                 0.0.7
MarkupSafe                  2.1.3
marshmallow                 3.20.1
matplotlib                  3.8.2
matplotlib-inline           0.1.6
mdurl                       0.1.2
merkle-json                 1.0.0
millify                     0.1.1
mistune                     3.0.2
more-itertools              10.1.0
mpmath                      1.3.0
multidict                   6.0.4
multiprocess                0.70.15
munch                       4.0.0
mypy-extensions             1.0.0
nbclient                    0.9.0
nbconvert                   7.12.0
nbformat                    5.9.2
nest-asyncio                1.5.8
networkx                    3.2.1
nltk                        3.8.1
notebook                    7.0.6
notebook_shim               0.2.3
numpy                       1.26.2
openai                      1.3.7
overrides                   7.4.0
packaging                   23.2
pandas                      2.1.3
pandocfilters               1.5.0
parso                       0.8.3
pexpect                     4.9.0
Pillow                      10.1.0
pinecone-client             2.2.4
pip                         23.3.1
platformdirs                4.1.0
prometheus-client           0.19.0
prompt-toolkit              3.0.41
protobuf                    4.25.1
psutil                      5.9.6
ptyprocess                  0.7.0
pure-eval                   0.2.2
pyarrow                     14.0.1
pyarrow-hotfix              0.6
pycparser                   2.21
pydantic                    1.10.13
pydantic_core               2.14.5
pydeck                      0.8.1b0
Pygments                    2.17.2
pymdown-extensions          10.5
pyparsing                   3.1.1
pypdf                       3.17.1
python-dateutil             2.8.2
python-decouple             3.8
python-dotenv               1.0.0
python-json-logger          2.0.7
python-slugify              8.0.1
pytz                        2023.3.post1
PyYAML                      6.0.1
pyzmq                       25.1.1
qtconsole                   5.5.1
QtPy                        2.4.1
referencing                 0.31.1
regex                       2023.10.3
requests                    2.31.0
rfc3339-validator           0.1.4
rfc3986-validator           0.1.1
rich                        13.7.0
rpds-py                     0.13.2
safetensors                 0.4.1
scikit-learn                1.3.2
scipy                       1.11.4
Send2Trash                  1.8.2
sentence-transformers       2.2.2
sentencepiece               0.1.99
setuptools                  65.5.0
six                         1.16.0
slack-bolt                  1.18.1
slack-sdk                   3.26.1
smmap                       5.0.1
sniffio                     1.3.0
soupsieve                   2.5
SQLAlchemy                  2.0.23
st-annotated-text           4.0.1
stack-data                  0.6.3
streamlit                   1.29.0
streamlit-aggrid            0.3.4.post3
streamlit-camera-input-live 0.2.0
streamlit-card              0.0.61
streamlit-embedcode         0.1.2
streamlit-extras            0.3.5
streamlit-faker             0.0.3
streamlit-image-coordinates 0.1.6
streamlit-javascript        0.1.5
streamlit-keyup             0.2.0
streamlit-toggle-switch     1.0.2
streamlit-vertical-slider   1.0.2
sympy                       1.12
tenacity                    8.2.3
terminado                   0.18.0
text-unidecode              1.3
threadpoolctl               3.2.0
tiktoken                    0.5.2
tinycss2                    1.2.1
tokenizers                  0.15.0
toml                        0.10.2
toolz                       0.12.0
torch                       2.1.1
torchvision                 0.16.1
tornado                     6.4
tqdm                        4.66.1
traitlets                   5.14.0
transformers                4.35.2
trulens-eval                0.18.2
types-python-dateutil       2.8.19.14
typing_extensions           4.5.0
typing-inspect              0.8.0
tzdata                      2023.3
tzlocal                     5.2
uri-template                1.3.0
urllib3                     1.26.18
validators                  0.22.0
wcwidth                     0.2.12
webcolors                   1.13
webencodings                0.5.1
websocket-client            1.7.0
widgetsnbextension          4.0.9
wrapt                       1.16.0
xxhash                      3.4.1
yarl                        1.9.3
you-get                     0.4.1650
zipp                        3.17.0
jonmach commented 9 months ago

Problem resolved by adding:

    _model: INSTRUCTOR = PrivateAttr()
    _instruction: str = PrivateAttr()

to the InstructorEmbeddings class

Omegapy commented 8 months ago

I also got the error:

ValueError: "InstructorEmbeddings" object has no field "_model"

My solution

This is my fix:

class InstructorEmbeddings(BaseEmbedding):

    _instruction: str = "Represent the Computer Science text for retrieval:"

    def __init__(
        self, 
        instructor_model_name: str = "hkunlp/instructor-large",
        **kwargs: Any,
    ) -> None:
        _model: INSTRUCTOR = INSTRUCTOR(instructor_model_name)
        super().__init__(**kwargs)

    def _get_query_embedding(self, query: str) -> List[float]:
        embeddings = model.encode([[self._instruction, query]])
        return embeddings[0].tolist()

    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)

    def _get_text_embedding(self, text: str) -> List[float]:
        embeddings = model.encode([[self._instruction, text]])
        return embeddings[0].tolist() 

    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        embeddings = model.encode([[self._instruction, text] for text in texts])
        return embeddings.tolist()

My results are the same as the video:

embed = instructor_embeddings.get_text_embedding("How do I create a vector index?")
print(len(embed))
print(embed[:10])
768
[0.003987060859799385, 0.012122981250286102, 0.002690523862838745, 0.01581709273159504, -0.005555964540690184, 0.03673827275633812, 0.010727009736001492, 0.00666137645021081, -0.0392913892865181, 0.013146855868399143]
mvitas commented 1 month ago

Current working solution as of Jul 23rd 2024.

from typing import Any, List
from InstructorEmbedding import INSTRUCTOR
from llama_index.core.embeddings import BaseEmbedding
**from pydantic import Extra**

class InstructorEmbeddings(BaseEmbedding):

    class Config:
        extra = Extra.allow

    _instruction: str = "Represent the Computer Science text for retrieval:"

    def __init__(
        self, 
        instructor_model_name: str = "hkunlp/instructor-large",
        **kwargs: Any
    ) -> None:
        super().__init__(**kwargs)
        self._model: INSTRUCTOR = INSTRUCTOR(instructor_model_name)

    def _get_query_embedding(self, query: str) -> List[float]:
        embeddings = self._model.encode([[self._instruction, query]])
        return embeddings[0].tolist()

    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)

    def _get_text_embedding(self, text: str) -> List[float]:
        embeddings = self._model.encode([[self._instruction, text]])
        return embeddings[0].tolist() 

    def _get_text_embeddings(self, texts: List[str]) -> List[List[float]]:
        embeddings = self._model.encode([[self._instruction, text] for text in texts])
        return embeddings.tolist()

Adding instance var model with pydantic validation in place

BaseEmbedding class has pydantic validation, meaning that no extra fields can be added to InstructorEmbeddings child class out of the box.

Add following code to allow extra fields to be defined.

class Config: extra = Extra.allow

Initialize BaseEmbedding super class before initializing model

def __init__(
        self, 
        instructor_model_name: str = "hkunlp/instructor-large",
        **kwargs: Any
    ) -> None:
        super().__init__(**kwargs) # placement of this line is important
        self._model: INSTRUCTOR = INSTRUCTOR(instructor_model_name)