microsoft / autogen

A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
28.1k stars 4.1k forks source link

[Issue]: pgvector query returning byts instead of string #2667

Open capella-ben opened 1 month ago

capella-ben commented 1 month ago

Describe the issue

When running the pgvector example I get the following error:

m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Trying to create collection.
2024-05-12 19:51:30,510 - autogen.agentchat.contrib.retrieve_user_proxy_agent - INFO - Use the existing collection `flaml_collection`.
File M:\OneDrive\Documents\dev\DISCO\pgVector\..\website\docs does not exist. Skipping.
2024-05-12 19:51:30,974 - autogen.agentchat.contrib.retrieve_user_proxy_agent - INFO - Found 2 chunks.
2024-05-12 19:51:30,975 - autogen.agentchat.contrib.vectordb.pgvectordb - INFO - Error executing select on non-existent table: flaml_collection. Creating it instead. Error: relation "flaml_collection" does not exist 
LINE 1: SELECT id, metadatas, documents, embedding FROM flaml_collec...
                                                        ^
2024-05-12 19:51:31,007 - autogen.agentchat.contrib.vectordb.pgvectordb - INFO - Created table flaml_collection
VectorDB returns doc_ids:  [[b'bdfbc921', b'7968cf3c']]
Traceback (most recent call last):
  File "m:\OneDrive\Documents\dev\DISCO\pgVector\autogen_pgvector_1.py", line 84, in <module>
    chat_result = ragproxyagent.initiate_chat(
  File "m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\autogen\agentchat\conversable_agent.py", line 1004, in initiate_chat
    msg2send = message(_chat_info["sender"], _chat_info["recipient"], kwargs)
  File "m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\autogen\agentchat\contrib\retrieve_user_proxy_agent.py", line 631, in message_generator
    doc_contents = sender._get_context(sender._results)
  File "m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\autogen\agentchat\contrib\retrieve_user_proxy_agent.py", line 426, in _get_context
    _doc_tokens = self.custom_token_count_function(doc["content"], self._model)
  File "m:\OneDrive\Documents\dev\DISCO\pgVector\.venv\lib\site-packages\autogen\token_count_utils.py", line 69, in count_token
    raise ValueError(f"input must be str, list or dict, but we got {type(input)}")
ValueError: input must be str, list or dict, but we got <class 'bytes'>    

After some investigation psycopg (3.1.19) always returns the id and descriptions fields as bytes. pgvectordb.py is expecting strings. It is unclear why these 2 fields are always returned as bytes. Other fields in other tables on my postgres server do return strings as expected.

The only workaround I can find is to decode in pgvectordb.py:

Update pgvectordb py

Is this an issue others are facing?

Steps to reproduce

Screenshots and logs

image

Additional Information

pyautogen-0.2.27

ErikQQY commented 1 month ago

Same issue here, stack trace:

    chat_result = ragproxyagent.initiate_chat(
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/autogen/agentchat/conversable_agent.py", line 988, in initiate_chat
    msg2send = message(_chat_info["sender"], _chat_info["recipient"], kwargs)
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/autogen/agentchat/contrib/retrieve_user_proxy_agent.py", line 631, in message_generator
    doc_contents = sender._get_context(sender._results)
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/autogen/agentchat/contrib/retrieve_user_proxy_agent.py", line 426, in _get_context
    _doc_tokens = self.custom_token_count_function(doc["content"], self._model)
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/autogen/token_count_utils.py", line 65, in count_token
    return _num_token_from_text(input, model=model)
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/autogen/token_count_utils.py", line 75, in _num_token_from_text
    encoding = tiktoken.encoding_for_model(model)
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/tiktoken/model.py", line 101, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
  File "/root/anaconda3/envs/qqy/lib/python3.10/site-packages/tiktoken/model.py", line 77, in encoding_name_for_model
    if model_name in MODEL_TO_ENCODING:
TypeError: unhashable type: 'dict'
thinkall commented 2 weeks ago

Thank you @capella-ben for reporting! @Knucklessg1 , could you help take a look?

Knucklessg1 commented 1 week ago

I will take a look into this.