Open Lu-Chengyu opened 1 month ago
Hey @Lu-Chengyu! I'm here to help you with any coding issues or questions you have. If you need assistance, feel free to ask!
To process the images and JSON to associate the images with their relevant context in the JSON and generate a comprehensive index using the create_piazza_index
function, follow these steps:
TextNode
objects.Here is the code to achieve this:
# Step 1: Download and Prepare Images
image_files = get_image_files("path/to/your/images", sample=100)
# Step 2: Extract Structured Outputs
outputs = await aprocess_image_files(image_files)
# Step 3: Convert Structured Outputs to Nodes
nodes = get_nodes_from_objs(outputs, image_files)
# Step 4: Index Nodes in Vector Stores
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="qdrant_gemini")
vector_store = QdrantVectorStore(client=client, collection_name="collection")
# Global settings
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = (Gemini(api_key=GOOGLE_API_KEY),)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
)
# Step 5: Define Auto-Retriever
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="Receipts",
metadata_info=[
MetadataInfo(
name="company",
description="The name of the store",
type="string",
),
MetadataInfo(
name="address",
description="The address of the store",
type="string",
),
MetadataInfo(
name="date",
description="The date of the purchase (in DD/MM/YYYY format)",
type="string",
),
MetadataInfo(
name="total",
description="The final amount",
type="float",
),
MetadataInfo(
name="currency",
description="The currency of the country the purchase was made (abbreviation)",
type="string",
),
],
)
from llama_index.core.retrievers import VectorIndexAutoRetriever
retriever = VectorIndexAutoRetriever(
index,
vector_store_info=vector_store_info,
similarity_top_k=2,
empty_query_top_k=10, # if only metadata filters are specified, this is the limit
verbose=True,
)
This code will process the images, extract structured outputs, convert them to nodes, index them, and set up an auto-retriever for querying the indexed data [1].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
`def create_piazza_index(json_file_path, index_folder, levels_back=None, collapse_length=None, is_jsonl=False, clean_json=True):
Initialize the JSONReader with the desired parameters
I need to create a Q&A bot, and I have some previous Q&A data that includes both images and text. Previously, I only extracted the text and stored it in JSON format, using JSONReader to generate the document. Now, I want to include the images, so I have downloaded the images locally and added the image paths to the corresponding Q&A positions in the JSON. How should I process the images and JSON to associate the images with their relevant context in the JSON and generate a comprehensive index?