run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.84k stars 5.28k forks source link

[Question]: Subdividing nodes for data ingestion #11990

Closed GEN200J closed 5 months ago

GEN200J commented 8 months ago

Question Validation

Question

I want to subdivide nodes after creating them. Here is the code: dotenv_path = ( "/home/panther2401/AI_phonecalls/gpt4-pdf-chatbot-langchain/jurisprudencias/.env" # Google Colabs will not let you open a .env, but you can set )

with open(dotenv_path, "w") as f: f.write('PINECONE_API_KEY="x"\n') f.write('PINECONE_ENVIRONMENT="gcp-starter"\n') f.write('OPENAI_API_KEY="x"\n')

Read and print the contents of the .env file

with open(dotenv_path, "r") as f: for line in f: print(line.strip())

import os from dotenv import load_dotenv

load_dotenv(dotenv_path=dotenv_path)

import pinecone from pinecone import Pinecone

from pinecone import Pinecone api_key = "x" environment = os.environ["PINECONE_ENVIRONMENT"] pc = Pinecone(api_key=api_key, environment=environment

index_name = "llamaindex-rag-fs"

from pinecone import PodSpec

pc.create_index( index_name, dimension=1024, metric="cosine", spec=PodSpec( environment="gcp-starter", pod_type="p1.x1", pods=1 ) )

pinecone_index = pc.Index(index_name)

from llama_index.vector_stores.pinecone import PineconeVectorStore vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

import fitz file_path = "x" doc = fitz.open(file_path)

from llama_index.core.node_parser import SentenceSplitter

text_parser = SentenceSplitter( chunk_size=1024, separator=" ", )

text_chunks = []

maintain relationship with source doc index, to help inject doc metadata in (3)

doc_idxs = [] for doc_idx, page in enumerate(doc): page_text = page.get_text("text") cur_text_chunks = text_parser.split_text(page_text) text_chunks.extend(cur_text_chunks) doc_idxs.extend([doc_idx] * len(cur_text_chunks))

from llama_index.core.schema import TextNode nodes = [] for idx, text_chunk in enumerate(text_chunks): node = TextNode( text=text_chunk, ) src_doc_idx = doc_idxs[idx] src_page = doc[src_doc_idx] nodes.append(node)

The code was copy pasted from google colab so it isn't very neat as you can see. What I wanted to do is to subdivide those nodes for this to happen: 

Subdivide into Sub-chunks

Further Segmentation: Divide these nodes into even smaller sub-chunks. This step increases granularity, enabling a more precise matching process between the user's query and the document content. The finer the segmentation, the higher the potential for accurately identifying the most relevant information.

How can I do this so then I can finish the data ingestion to Pinecone? Is there some llama-index site that can help?

Thanks for taking the time to read this.

dosubot[bot] commented 8 months ago

Hey there, @GEN200J! Nice to see you diving into some interesting challenges again. 😊

To subdivide nodes into smaller sub-chunks for more precise matching during data ingestion to Pinecone using the LlamaIndex framework, you can utilize the provided text splitting utilities in the llama_index/core/node_parser/text/utils.py file to create custom text splitters that match your specific requirements for granularity. Here's how you can do it:

  1. Define a Custom Text Splitter: Depending on the granularity you need, choose an appropriate method to split your text. For instance, if you need to split text into sentences or phrases for more precise matching, you can use split_by_sentence_tokenizer or split_by_phrase_regex respectively.

  2. Implement the Custom Splitter in Your Ingestion Process: Once you have defined your custom text splitter, integrate it into your data ingestion process. This involves applying the splitter to your text data before sending it to Pinecone.

  3. Adjust Chunk Size if Necessary: If you're dealing with very large texts or documents, you might also need to truncate or further subdivide the text into smaller chunks that fit within Pinecone's size limits. You can use the truncate_text function for this purpose, modifying it as necessary to accommodate your chunk size requirements.

Here's an example of how you might implement a custom splitter using the split_by_sentence_tokenizer for more granular text matching:

from llama_index.core.node_parser.text.utils import split_by_sentence_tokenizer

# Define your custom text splitter
custom_text_splitter = split_by_sentence_tokenizer()

# Example text
text = "This is the first sentence. Here's the second sentence. And this is the third."

# Split the text using the custom splitter
sub_chunks = custom_text_splitter(text)

# Now, sub_chunks contains each sentence as a separate element
# ['This is the first sentence.', "Here's the second sentence.", 'And this is the third.']

# You can now ingest these sub_chunks into Pinecone for more precise matching

This approach allows you to tailor the granularity of your text splitting to the specific needs of your application, improving the precision of matching in Pinecone.

For more information, you can refer to the llama-index-core/llama_index/core/node_parser/text/utils.py file in the LlamaIndex repository.

Regarding the maximum size limit for data ingestion in Pinecone using the LlamaIndex framework, I wasn't able to find an answer within the repository. You might want to check Pinecone's official documentation or reach out to their support for this information.

The SentenceSplitter class in the LlamaIndex framework is designed to split text into smaller chunks or sentences based on specified parameters. It can be customized for more granular text splitting by adjusting its initialization parameters and by using a tokenizer for more sophisticated splitting logic. You can refer to the llama-index-core/tests/text_splitter/test_sentence_splitter.py file for more details on how to use and customize this class.

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.