microsoft / autogen

A programming framework for agentic AI 🤖
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
35.02k stars 5.07k forks source link

Attempting RAG with .csv files. #829

Closed pranavvr-lumiq closed 7 months ago

pranavvr-lumiq commented 12 months ago

I am trying to modify the following bit of code to read through a set of .csv files and generate an output.

boss_aid = RetrieveUserProxyAgent( name="Boss_Assistant", is_termination_msg=termination_msg, system_message="Assistant who has extra content retrieval power for solving difficult problems.", human_input_mode="NEVER", max_consecutive_auto_reply=3, retrieve_config={ "task": "code", "docs_path": "https://raw.githubusercontent.com/microsoft/FLAML/main/website/docs/Examples/Integrate%20-%20Spark.md", "chunk_token_size": 1000, "model": config_list[0]["model"], "client": chromadb.PersistentClient(path="/tmp/chromadb"), "collection_name": "groupchat", "get_or_create": True, }, code_execution_config=False, # we don't want to execute code in this case.

pranavvr-lumiq commented 12 months ago

Here is my attempt to do so: underwriter_files = ['file1.csv', 'file2.csv', etc]

boss_aid = RetrieveUserProxyAgent( name="Boss_Assistant", is_termination_msg=termination_msg, system_message="Assistant who has extra content retrieval power for solving difficult problems.", human_input_mode="TERMINATE", max_consecutive_auto_reply=3, retrieve_config={ "task": "code", "docs_path": underwriter_files, "chunk_token_size": 1000, "model": config_list, "client": chromadb.PersistentClient(path="/tmp/chromadb"), "collection_name": "groupchat", "get_or_create": True, }, code_execution_config=False, # we don't want to execute code in this case. )

pranavvr-lumiq commented 12 months ago

However, I am getting the following error:

Trying to create collection. max_tokens is too small to fit a single line of text. Breaking this line: _id,workItemID ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: POL_ID,AUREOUS_RISK_SCORE1,AUREOUS_RISK_BAND1,AUREOUS_RISK_SCORE2,AUREOUS_RISK_BAND2,AUREOUS_RISK_SC ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","FCRR_RATING" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","IIB_QUEST_IS_NEGATIVE" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","IIB_SCORE" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "CLI_ID","POL_ID","LA_EXST_CLI_IND","CLI_BTH_DT","AGE_PROOF_TYP_CD","CLI_SEX_CD","CLI_MARIT_STAT_CD" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","BNFY1_REL_INSRD_CD","BNFY2_REL_INSRD_CD","BNFY3_REL_INSRD_CD" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: POL_ID,POL_BILL_MODE_CD,PLAN_ID,POL_MPREM_AMT,CVG_FACE_AMT,POLICY_TERM,PPT,PREMIUM_FREQUENCY ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","PROPOSER_RELATIONSHIP","PROPOSER_EARN_INCM_AMT" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","TRC_PROPOSAL" ... Failed to split docs with must_break_at_empty_line being True, set to False. max_tokens is too small to fit a single line of text. Breaking this line: "POL_ID","UW_DECISION" ... Failed to split docs with must_break_at_empty_line being True, set to False. doc_ids: [['doc_1675', 'doc_1766', 'doc_1767']]

TypeError Traceback (most recent call last) Cell In[12], line 1 ----> 1 rag_chat()

Cell In[8], line 9, in rag_chat() 6 manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config) 8 # Start chatting with boss_aid as this is the user proxy agent. ----> 9 boss_aid.initiate_chat( 10 manager, 11 problem=PROBLEM, 12 n_results=3, 13 )

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/autogen/agentchat/conversable_agent.py:550, in ConversableAgent.initiate_chat(self, recipient, clear_history, silent, context) 536 """Initiate a chat with the recipient agent. 537 538 Reset the consecutive auto reply counter. (...) 547 "message" needs to be provided if the generate_init_message method is not overridden. 548 """ 549 self._prepare_chat(recipient, clear_history) --> 550 self.send(self.generate_init_message(context), recipient, silent=silent)

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/autogen/agentchat/contrib/retrieve_user_proxy_agent.py:420, in RetrieveUserProxyAgent.generate_init_message(self, problem, n_results, search_string) 418 self.problem = problem 419 self.n_results = n_results --> 420 doc_contents = self._get_context(self._results) 421 message = self._generate_message(doc_contents, self._task) 422 return message

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/autogen/agentchat/contrib/retrieve_user_proxy_agent.py:252, in RetrieveUserProxyAgent._get_context(self, results) 250 if results["ids"][0][idx] in self._doc_ids: 251 continue --> 252 _doc_tokens = self.custom_token_count_function(doc, self._model) 253 if _doc_tokens > self._context_max_tokens: 254 func_print = f"Skip doc_id {results['ids'][0][idx]} as it is too long to fit in the context."

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/autogen/token_count_utils.py:57, in count_token(input, model) 48 """Count number of tokens used by an OpenAI model. 49 Args: 50 input: (str, list, dict): Input to the model. (...) 54 int: Number of tokens from the input. 55 """ 56 if isinstance(input, str): ---> 57 return _num_token_from_text(input, model=model) 58 elif isinstance(input, list) or isinstance(input, dict): 59 return _num_token_from_messages(input, model=model)

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/autogen/token_count_utils.py:67, in _num_token_from_text(text, model) 65 """Return the number of tokens used by a string.""" 66 try: ---> 67 encoding = tiktoken.encoding_for_model(model) 68 except KeyError: 69 logger.warning(f"Model {model} not found. Using cl100k_base encoding.")

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/tiktoken/model.py:97, in encoding_for_model(model_name) 92 def encoding_for_model(model_name: str) -> Encoding: 93 """Returns the encoding used by a model. 94 95 Raises a KeyError if the model name is not recognised. 96 """ ---> 97 return get_encoding(encoding_name_for_model(model_name))

File ~/anaconda3/envs/pyautogen/lib/python3.10/site-packages/tiktoken/model.py:73, in encoding_name_for_model(model_name) 68 """Returns the name of the encoding used by a model. 69 70 Raises a KeyError if the model name is not recognised. 71 """ 72 encoding_name = None ---> 73 if model_name in MODEL_TO_ENCODING: 74 encoding_name = MODEL_TO_ENCODING[model_name] 75 else: 76 # Check if the model matches a known prefix 77 # Prefix matching avoids needing library updates for every model version release 78 # Note that this can match on non-existent models (e.g., gpt-3.5-turbo-FAKE)

TypeError: unhashable type: 'list'

joshkyh commented 12 months ago
max_tokens is too small to fit a single line of text. Breaking this line:
_id,workItemID ...
Failed to split docs with must_break_at_empty_line being True, set to False.
max_tokens is too small to fit a single line of text. Breaking this line:
POL_ID,AUREOUS_RISK_SCORE1,AUREOUS_RISK_BAND1,AUREOUS_RISK_SCORE2,AUREOUS_RISK_BAND2,AUREOUS_RISK_SC ...

Looks like the program is trying to create an embedding for structured data.

SQL database might be a more natural fit for the structured csv data, relative to RAG's vector DB. I'm not sure if you have seen https://github.com/microsoft/autogen/blob/main/notebook/agentchat_langchain.ipynb, it might be helpful.

pranavvr-lumiq commented 11 months ago

Also getting the following error: TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

whoami02 commented 11 months ago

Also getting the following error: TypeError: stat: path should be string, bytes, os.PathLike or integer, not list

your .csv files are given as a list. Either iterate over the list or just provide the file name directly

thinkall commented 7 months ago

Close as no further feedbacks from the issues's original author.