joaomdmoura / crewAI

Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.
https://crewai.com
MIT License
16.82k stars 2.27k forks source link

integration with big data corpus #773

Open phdykd opened 2 weeks ago

phdykd commented 2 weeks ago

Hi, I have tried to integrate big data corpus (all of them txt file and have are very large documents). Each time, I ask a question, it starts to search among all of the files that are in the corpus and can not go directly to the relevant txt where it can dig and find the answer... I am getting this error.. I have tried smenatic reranking, splitting documents, several things, and getting eror....

Function to split large content into chunks

def split_content(text_content, max_length=10000): return [text_content[i:i + max_length] for i in range(0, len(text_content), max_length)]

Chat with agents

def chat_with_agents(query, processed_texts): tasks = [] agent, task_description = determine_agent(query)

for file_path, text_content in processed_texts:
    # Split the content into manageable chunks
    content_chunks = split_content(text_content)
    for chunk in content_chunks:
        if len(chunk) > 1048576:
            continue  # Ensure the chunk length is within allowable limits
        tasks.append(create_task_with_content(
            agent,
            task_description,
            chunk,
            file_path  # Pass the file path as the source
        ))

chat_crew = Crew(
    agents=[agent],
    tasks=tasks,
    process=Process.sequential
)

result = chat_crew.kickoff(inputs={'query': query})
return result

Example usage

if name == "main": user_query = "What is the projected growth trajectory for PD-L1 therapy sales from 2022 to 2027, and what percentage of total oncology sales is it expected to represent in 2027?" ranked_texts = search_and_rank(user_query, output_directory) chat_result = chat_with_agents(user_query, ranked_texts) print(chat_result)

BadRequestError: Error code: 400 - {'error': {'message': "Invalid 'messages[0].content': string too long. Expected a string with maximum length 1048576, but got a string with length 1163812 instead.", 'type': 'invalid_request_error', 'param': 'messages[0].content', 'code': 'string_above_max_length'}}

greg80303 commented 2 weeks ago

This doesn't seem like the correct way to use CrewAI. However, I don't see the implementation of create_task_with_content. To be sure, can you post all of your code used to create the tasks?

If you want to do semantic searching or Q/A against a large data corpus, you should be putting that data into a vector store and then having a Tool that does a search of that corpus for chunks relevant to your query. Those search results are then inserted into the prompt as context and sent to the LLM. You should not be passing your raw corpus text into the prompts (by creating a task for each chunk).

phdykd commented 2 weeks ago

Thanks for the reply. I have azure ai search index. Do you think would that be logical to use crewai agents as mentioned above integrating with the azure ai index/vectorstore? Thanks again.....

greg80303 commented 2 weeks ago

You'll need to define a custom tool. Either create your own, or maybe check LangChain to see if they have a tool already defined that will use your vector store. CrewAI supports LangChain tools natively.

greg80303 commented 2 weeks ago

I'm not sure your use-case is complex enough for crew. You actually don't need multiple tasks in your crew, because each of your query types only generates 1 task. Since each query falls into only one category, you should only be adding one task to your crew -- not all tasks. So your logic should be updated such that you select your task based on the query, then instantiate the crew with only that single task. So rather than multiple tasks, you actually have multiple crews, and you kickoff the correct crew based on the query type.

While I think this will work, it is sort of underutilizing CrewAI. Typical crews have multiple agents and multiple tasks that need to be executed. Maybe this is just a first revision of your project and you plan to add more functionality later.

phdykd commented 2 weeks ago

Despite to what you said, I should have multiple tasks. I am expecting these taska to be executed... agent_task_descriptions = { 'clinical_data_analyst': "Analyze the provided clinical trial data to identify key outcomes and trends. Prepare a comprehensive report highlighting significant findings and their implications. Search Query: {query}", 'regulatory_affairs_specialist': "Monitor recent regulatory approvals and summarize the key points. Identify any potential impact on ongoing or upcoming trials. Search Query: {query}", 'clinical_trial_strategist': "Develop and propose optimizations for the current clinical trial strategy. Focus on improving patient recruitment, retention, and overall trial efficiency. Search Query: {query}", 'pharmaceutical_market_analyst': "Track and analyze market trends related to new treatments. Predict future opportunities and provide a market forecast. Search Query: {query}" }

greg80303 commented 2 weeks ago

From the code you posted earlier, here is how I read it:

Therefore, for every query, you create a new Crew instance with only a single agent and single task. You may be using all of your tasks across multiple Crew instantiations, but again, each Crew instance has only one task. This is where I was highlighting the fact that you are not using some of the more powerful features of CrewAI -- that is, to run a Crew with multiple agents in a multi-task workflow.

All that being said, I think the code that you showed me should work. You are just assigning a single specific agent and a single specific task description based on the contents of the query. You run a new Crew for each query -- that all makes sense. Are you still getting an error?

iukea1 commented 2 weeks ago

Despite to what you said, I should have multiple tasks. I am expecting these taska to be executed... agent_task_descriptions = { 'clinical_data_analyst': "Analyze the provided clinical trial data to identify key outcomes and trends. Prepare a comprehensive report highlighting significant findings and their implications. Search Query: {query}", 'regulatory_affairs_specialist': "Monitor recent regulatory approvals and summarize the key points. Identify any potential impact on ongoing or upcoming trials. Search Query: {query}", 'clinical_trial_strategist': "Develop and propose optimizations for the current clinical trial strategy. Focus on improving patient recruitment, retention, and overall trial efficiency. Search Query: {query}", 'pharmaceutical_market_analyst': "Track and analyze market trends related to new treatments. Predict future opportunities and provide a market forecast. Search Query: {query}" }

I don't mean to butt in or be rude. But your use case is complex enough where you need to use langchain graph and traditional lang chain ..

The difficulty curve here is much greater, but the rewards are worth it. I'm currently going through 450 GB of raw structured and unstructured data per day

Recommendation

Then use that embedded data to help you code your solution.

Otherwise the skill set for Lang-graph it's just not there yet and publicly available aka blogs.

But we now have a 18 agent system and proper intra agent communication and flow is being done with little loss of context between all agent steps.

In total per day going through 40 million tokens