run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.23k stars 5.16k forks source link

Help with graph structure, any advice is appreciated! #1239

Closed batou069 closed 8 months ago

batou069 commented 1 year ago

Hi all!

I start by saying sorry for this long thread, Im not good at getting to the point as you can see... I'm knee-deep into this and have so many questions, and so little people I can ask (nobody really lol), I really want to make this happen since I honestly believe in the potential and value it can bring.

Intro & Context

I'm working on a hackathon project for work: A Slack bot that answers questions about our product, so far created in Google Colab as a POC. I'm a Product Operations person and have no real programming background (I know SQL and thought myself qbasic when I was a kid, but thats it), I really have a hard time to understand the "reference" part of the docs and to translate the info into actual functions. So bare with me, I'm really proud that I got so far (thanks GPT4 and for all the example notebooks!)

I did read the WHOLE documentation, multiple times, believe me that it's not so clear for someone like me (not an engineer, but a technically inclined person, I do have a proxmox server and play around with docker but thats it lol). Meaning I'm good with finding answers online and to integrate answers to my cases.

Our product documentation is dispersed, we find information across 4 different platforms (3 KBs & Slack channels) and have many knowledge gaps. Info can be on spot and relevant, but some other articles can be outdated, wrong, and other topics can just be missing. Slack in our org is super-active and many questions get only answered in specific slack channels, but it's hard to navigate and to find those. I general I'd say that any employee searching for answers has a hard time to find them, and not all employees have access to all knowledge bases...

My idea was to have centralized knowledge access via Slack, no matter the source. By logging (an analyzing) all questions, answers & emoji reactions to snowflake, we would be able to identify areas that need updated or new content (no answer or negative reaction to answer). Thus replacing a knowledge related project I really didn't want to do (or believed in) with a smart and AI based tool, bringing demand driven knowledge improvement.

I thought of "simply" indexing each source and building a graph of those indices that can be queried.

I do have a working POC, but

Any hint or advice is appreciated !

Let me share my questions first followed by code and some info:

Questions:

1) High level structure I created an Index for each source, added a broad explanation of each as summary and combined those into a graph. I wonder if this is the right approach. A graph needs summaries and it makes not much sense to summarize each source as a whole (Summarize Confluence....?). Wouldn't it make more sense to first index each article separately, then creating one graph per source, and then creating one graph thats made out of those 4 graphs? The docs talk about being able to indefinitely stack indices over indices.

2) Vector VS Tree VS ... I created SimpleVector indices for each source and combined them as a TreeIndex-Graph. The example notebooks here I saw used for the graph either List or Knowledge-graph indeces. I thought list would force the query to go over all the content for every query which i wanted to avoid & Knowledge-graph wouldn't be a good fit since my data is not structured like a knowledge-graph should be. Is my gut feeling wrong? Or... if I want to make sure to get combined answers from multiple sources , do I then I need to use ListIndex for the Graph?? Also, if I stack multiple indices over each other (as suggested in Question 1), what index type would make sense to use in such a case? Tree over Tree over Vector? Tree over Keyword over Vector? I saw that page that talks about Routing VS Synthesis over Heterogenous Data, and I get the impression it would make sense to index each article with either vectorindex or keywordindex, have each article summarized and create a TreeGraph each for all 3 knowledgebases and then 1 listIndex for the final graph. Slack is different since its one big txt file that contains question, answer, summary. Mabye this could be used differently im not sure... I have a hard time understanding the differences between indices since from my understanding all indices will turn text into a vector full of numbers, not only the vector one, so its a bit confusing...

3) query_configs I did not define num_children, child_branch_factor, and am not sure if I defined index_struct_type, query_mode, similarity_top_k or response_mode correctly. Did I miss a parameter, did I chose some parameters wrongly? Does something in my code scream "this will make the query slow!"?

4) Text Chunking I did not specify chunking, I assume it defaults to some chunk size? Articles on all the platforms can vary from very short to very long, there is no clear pattern.

5) Playground I think I could answer some of my questions would I understand how to use the playground with such a complex structure, would be nice if someone could explain how to test not only different indices for the docs, but also testing different models, different indices for the graph, different query_configs etc...

5) Summaries Like already mentioned, I understand those summaries I did make no sense and the LLM won't know from those summaries based on a query to what node to reach... Are those summaries really necessary? Can't I just take the whole content of all 4 sources, index the everything? Im sure I can but would it be more efficient?

7) Making it future-proof The short-term goal of this bot will be to identify areas of improvement to give feedback to product owners and have content created. This means articles can be updated, removed, and new ones can be created. I of course want the most updated version of the content in my index. What needs to change to pull/load/index everything again but to skip what didn't change, only indexing the delta and removing whatever was removed?

Info & Code

What I did: I'm dealing with 4 sources: 3 Knowledge Bases and Slack. KBs: client facing helpcenter, Confluence, bloomfire

I could not make the Confluence connector nor the official "Confluence by atlassian-python-api" work (i need to basic auth with a base64 encoded "username:apitoken" combo, but both other solutions mentioned ask for username and pwd separately, couldnt figure out a solution), so I grabbed the articles with a simple get_data loop from the relevant spaces with a script and pulling all relevant articles. Each article is one CSV file containing columns: title, content, URL. All CSVs are in the same folder. The other 2 KBs I got 1 JSON file per KB containing all articles.

I'm not copying the whole thing, only the most relevant parts I have questions about or a lack of confidence in:

Loading content & Indexing

reader = JSONReader()

# LLMPredictor (gpt-4)
llm_predictor_gpt4 = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-4"))

service_context_gpt4 = ServiceContext.from_defaults(llm_predictor=llm_predictor_gpt4)

# Contentful
docs1 = reader.load_data("contentful_articles.json") 
index1 = GPTSimpleVectorIndex.from_documents(docs2, service_context=service_context_gpt4)
index1.save_to_disk('contentful_index.json')
index1 = GPTSimpleVectorIndex.load_from_disk('contentful_index.json')

# Confluence
docs2 = SimpleDirectoryReader('docs').load_data()
index2 = GPTSimpleVectorIndex.from_documents(docs2, service_context=service_context_gpt4)
index2.save_to_disk('confluence_index.json')
index2 = GPTSimpleVectorIndex.load_from_disk('confluence_index.json')

Contentful, Bloomfire were handled the same way, loading with JSONReader and indexing with GPTSimpleVectorIndex.

Slack channels were different, due to PII issues I did not get yet access to production slack, so I created a new slack workspace, invited 2 colleagues and we reproduced some discussions from the real slack in where questions were answered. I then created a script that with openAIs help only takes those threads where the LLM decided a question was answered:

def process_messages(messages):
    chatgpt_prompt = (
        "If the following thread contains a question that you deem answered by the responses, "
        "please summarize it as Question: question | Answer: answer | Summary: summary. "
        "If the question is not answered, skip it completely. "
        "The Summary should be <=20 words. I will take your response and create an Index from it.")

Like that I created a .txt file with only those answered questions from the channels of interest and a summary. Loaded this one file with SimpleDirectoryReader and used the same GPTSimpleVectorIndex.

Graph

Summaries

index1_summary = "A client-facing knowledgebase "
index2_summary = "Those are articles from our Confluence Knowledgebase for teams A, B, C, D, E, F and G. Here you can learn about the teams, the product and internal information"
index3_summary = This is a collection of articles with very detailed information about the technical side of X. Very relevant for {employee type Y}"
index4_summary = "A collection of answered questions from multiple company slack chat channels"

(replaced some too specific info with generic stuff)

Building Graph

all_indices = [index1, index2, index3, index4]
index_summaries=[index1_summary, index2_summary, index3_summary, index4_summary]

graph = ComposableGraph.from_indices(GPTTreeIndex, all_indices, index_summaries=index_summaries)    
graph.save_to_disk("graph.json")

Querying

# set query config
query_configs = [
    {
        "index_struct_type": "simple_dict",
        "query_mode": "default",
        "query_kwargs": {
            "similarity_top_k": 3,
            "response_mode": "tree_summarize"
        }
    },
]

response = index.query(text, query_configs=query_configs)

print(f"Your questions was {text}: \nThe answer is: {str(response)}")

And this is sent to the user to Slack

smyja commented 1 year ago
daxeel commented 1 year ago

I am using graph for 2 indices. But querying on graph takes 20 seconds and for individual index takes 5-6 seconds. How to make graph queries faster?

batou069 commented 1 year ago
  • You shouldn't have separate files for each articles, all articles can be in one Json file or a dictionary, with the title, links etc as keys.

For 2 of the KBs I do have 1 JSON per source, each JSON containing all articles. But for confluence I can pull per request up to 50 articles with the content, but we have over 500 articles to pull. I could pull all articles an join then into one JSON, but I feel that there is no difference between that and using the SimpleDirectoryReader.

  • You shouldn't continue indexing if you've indexed once already. Just load from disk. Only index when it's really needed/data source has been updated with something vital.

I just showed how I indexed because it may be part of the problem, when I run the bot the indexing is commented out and I only load from disk.

manuel-84 commented 1 year ago

let us know if you manage to get this less slow

DanBruckner commented 1 year ago

We are having the same issue with the graph taking forever (around the same 20 seconds). Has anyone had any luck with their graph working quickly? If so, which version of llama-index are you on and would you mind sharing your code? Thanks!

dosubot[bot] commented 1 year ago

Hi, @batou069! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you are seeking advice on the structure and optimization of a graph for your Slack bot project. You have questions about the high-level structure, index types, query configurations, text chunking, and making the system future-proof. There have been some suggestions from other users, such as using a single JSON file or dictionary for all articles and loading from disk instead of re-indexing. Additionally, there was a question about making graph queries faster.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding, and we look forward to hearing from you soon!

DanBruckner commented 1 year ago

We are still having the same issue with the slow graph response times. Even with just a couple indexes, it still takes 30 - 40 seconds to get a response from a graph query. Any hope that this is a priority to make the graph query responses quicker?

dosubot[bot] commented 1 year ago

@logan-markewich Could you please help @DanBruckner with the slow graph response times? They are still experiencing the issue even with a couple of indexes, and it takes 30-40 seconds to get a response from a graph query. Thank you!

DanBruckner commented 1 year ago

Yes, @logan-markewich - any help would be much appreciated!

logan-markewich commented 1 year ago

@DanBruckner for more context, can you share a bit more of your setup? What are you running exactly?

DanBruckner commented 1 year ago

Hey @logan-markewich,

Thanks for your help with this. I've passed the link to this thread to our lead AI developer Yuvan Sharma and asked him to respond with the technical details so that you know our setup.

He should be responding here shortly.

yuvansharma commented 1 year ago

@logan-markewich We've been creating indices for a few documents, and creating a simple keyword graph over even 2-3 of these indices leads to a long response time. As a result, we've resorted to using RouterQueryEngine as of now, but our use case is probably more suitable for a graph.

yuvansharma commented 1 year ago

For instance, I just made a graph over just 2 indices for two different phones. For a simple comparison query, it took around 630 seconds for a response.

logan-markewich commented 1 year ago

@yuvansharma that seems pretty extreme. Would be nice to have a reproducible case.

Tbh though, the composable graph is pretty much deprecated/unmaintained. I would point to other features instead (sub question engine, router engine, retriever router, agents)

DanBruckner commented 1 year ago

Hi @logan-markewich - Thx for your thorughts. We do have a reproducable case if you'd like to jump on a quick meeting sometime. That one is longer than what we've seen in the past, but they've all been a couple minutes at least...

That's a bummer to hear that the composable graph is unmaintained. We'd definitely have a use for it if it worked efficiently and quickly. We'll take a look a the sub question engine, retriever router and agents, though.

If you do start to work on the graph, please let @yuvansharma and myself know.

Thanks!

dosubot[bot] commented 10 months ago

Hi, @batou069,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you are seeking advice on structuring a graph for a Slack bot that answers product-related questions from multiple sources. There have been discussions and suggestions around high-level structure, index types, query configurations, text chunking, and making the bot future-proof. Additionally, there are discussions about slow graph response times and the potential deprecation of the composable graph feature.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out.