Open pulkitmehtaworkmetacube opened 1 week ago
To split the process of chunking, embedding documents, and inserting them into a vector database into separate components, you can follow these steps:
Load the Data:
Use a reader to load your data into Document
objects.
from llama_index.core import SimpleDirectoryReader
loader = SimpleDirectoryReader(input_dir="./data")
documents = loader.load_data()
Chunk the Data: Use a text splitter to chunk the documents into smaller pieces (Nodes).
from llama_index.core.node_parser import SentenceSplitter
text_splitter = SentenceSplitter()
nodes = text_splitter.split(documents)
Embed the Data: Apply embeddings to the nodes.
from llama_index.embeddings.openai import OpenAIEmbedding
embedding_model = OpenAIEmbedding()
for node in nodes:
node.embedding = embedding_model.embed(node.text)
Insert into Vector Database: Create a vector index and insert the nodes.
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")
for node in nodes:
vector_store.add(node)
Alternatively, you can streamline the process using the IngestionPipeline
class, which applies specified transformations to the documents and stores them in a vector store:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
import qdrant_client
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="test_store")
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(),
OpenAIEmbedding(),
],
vector_store=vector_store,
)
_nodes = pipeline.run(documents=documents, num_workers=4)
By following these steps, you can modularize the process into separate components for loading, chunking, embedding, and indexing. Additionally, you can use the IngestionPipeline
class to streamline the process, making it more efficient and easier to manage [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
Hi everyone , I am benchmarking time it takes to chunk , embed documents and then insert into vector database .
Following code does everything in 1 shot. How can I split into different components ? . Please guide.
index = VectorStoreIndex.from_documents( documents, storage_context=storage_context, )