run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.26k stars 4.65k forks source link

[Question]: Benchmarking time it takes to chunk , embed documents , insert into vector database #14421

Open pulkitmehtaworkmetacube opened 1 week ago

pulkitmehtaworkmetacube commented 1 week ago

Question Validation

Question

Hi everyone , I am benchmarking time it takes to chunk , embed documents and then insert into vector database .

Following code does everything in 1 shot. How can I split into different components ? . Please guide.

index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, )

dosubot[bot] commented 1 week ago

To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:

  1. Load the Data: Use a reader to load your data into Document objects.

    from llama_index.core import SimpleDirectoryReader
    
    loader = SimpleDirectoryReader(input_dir="./data")
    documents = loader.load_data()
  2. Chunk the Data: Use a text splitter to chunk the documents into smaller pieces (Nodes).

    from llama_index.core.node_parser import SentenceSplitter
    
    text_splitter = SentenceSplitter()
    nodes = text_splitter.split(documents)
  3. Embed the Data: Apply embeddings to the nodes.

    from llama_index.embeddings.openai import OpenAIEmbedding
    
    embedding_model = OpenAIEmbedding()
    for node in nodes:
       node.embedding = embedding_model.embed(node.text)
  4. Insert into Vector Database: Create a vector index and insert the nodes.

    from llama_index.vector_stores.qdrant import QdrantVectorStore
    import qdrant_client
    
    client = qdrant_client.QdrantClient(location=":memory:")
    vector_store = QdrantVectorStore(client=client, collection_name="test_store")
    
    for node in nodes:
       vector_store.add(node)

Alternatively, you can streamline the process using the IngestionPipeline class, which applies specified transformations to the documents and stores them in a vector store:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)

By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline class to streamline the process, making it more efficient and easier to manage [1][2][3][4].

To continue talking to Dosu, mention @dosu.