[Question]: Benchmarking time it takes to chunk , embed documents , insert into vector database

To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:

Load the Data: Use a reader to load your data into Document objects.

from llama_index.core import SimpleDirectoryReader

loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()

Chunk the Data: Use a text splitter to chunk the documents into smaller pieces (Nodes).

from llama_index.core.node_parser import SentenceSplitter

text_splitter = SentenceSplitter()
nodes = text_splitter.split(documents)

Embed the Data: Apply embeddings to the nodes.

from llama_index.embeddings.openai import OpenAIEmbedding

embedding_model = OpenAIEmbedding()
for node in nodes:
   node.embedding = embedding_model.embed(node.text)

Insert into Vector Database: Create a vector index and insert the nodes.

from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

for node in nodes:
   vector_store.add(node)

Alternatively, you can streamline the process using the IngestionPipeline class, which applies specified transformations to the documents and stores them in a vector store:

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)

By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline class to streamline the process, making it more efficient and easier to manage ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: Benchmarking time it takes to chunk , embed documents , insert into vector database #14421

Question Validation

Question