Open terilias opened 3 days ago
The issue you're encountering with inserting documents with metadata into the Weaviate Vector Store likely stems from how metadata is handled during the insertion process. Here are some steps and suggestions to help debug and resolve this issue:
Check Metadata Conversion:
Ensure that the metadata is correctly converted and structured before insertion. The node_to_metadata_dict
function is responsible for this conversion. If the metadata structure is invalid, it could cause the insertion to fail silently.
def node_to_metadata_dict(
node: BaseNode,
remove_text: bool = False,
text_field: str = DEFAULT_TEXT_KEY,
flat_metadata: bool = False,
) -> Dict[str, Any]:
"""Common logic for saving Node data into metadata dict."""
node_dict = node.dict()
metadata: Dict[str, Any] = node_dict.get("metadata", {})
if flat_metadata:
_validate_is_flat_dict(metadata)
# store entire node as json string - some minor text duplication
if remove_text:
node_dict[text_field] = ""
# remove embedding from node_dict
node_dict["embedding"] = None
# dump remainder of node_dict to json string
metadata["_node_content"] = json.dumps(node_dict)
metadata["_node_type"] = node.class_name()
# store ref doc id at top level to allow metadata filtering
metadata["document_id"] = node.ref_doc_id or "None" # for Chroma
metadata["doc_id"] = node.ref_doc_id or "None" # for Pinecone, Qdrant, Redis
metadata["ref_doc_id"] = node.ref_doc_id or "None" # for Weaviate
return metadata
Add Logging for Debugging: Add logging to check the contents of the metadata before insertion. This will help identify any issues with the metadata that could be causing the silent failure.
import logging
def add_node(
client: "WeaviateClient",
node: BaseNode,
class_name: str,
batch: Optional[Any] = None,
text_key: str = DEFAULT_TEXT_KEY,
) -> None:
"""Add node."""
metadata = {}
metadata[text_key] = node.get_content(metadata_mode=MetadataMode.NONE) or ""
additional_metadata = node_to_metadata_dict(
node, remove_text=True, flat_metadata=False
)
metadata.update(additional_metadata)
vector = node.get_embedding()
id = node.node_id
# Log metadata for debugging
logging.debug(f"Inserting node with metadata: {metadata}")
# if batch object is provided (via a context manager), use that instead
if batch is not None:
batch.add_object(
properties=metadata, collection=class_name, uuid=id, vector=vector
)
else:
client.collections.get(class_name).data.insert(
properties=metadata, uuid=id, vector=vector
)
Ensure Metadata Conforms to Weaviate's Requirements: The metadata dictionary must be flat, with keys as strings and values as one of the types: string, integer, float, or None. Ensure that your metadata conforms to these requirements.
Check for Schema Existence: Ensure that the schema for the class exists before inserting documents. If the schema does not exist, it needs to be created.
def class_schema_exists(client: Any, class_name: str) -> bool:
"""Check if class schema exists."""
validate_client(client)
return client.collections.exists(class_name)
def create_default_schema(client: Any, class_name: str) -> None:
"""Create default schema."""
validate_client(client)
class_schema = {
"class": class_name,
"description": f"Class for {class_name}",
"properties": NODE_SCHEMA,
}
client.collections.create_from_dict(class_schema)
By following these steps, you should be able to identify and resolve the issue with inserting documents with metadata into the Weaviate Vector Store. If the problem persists, further investigation into the specific metadata structure and Weaviate's handling of metadata may be necessary [1][2].
To continue talking to Dosu, mention @dosu.
The issue here is the space on the property name.
"title of parental document"
This will work:
llama_index_doc = Document(doc_id='12345',
text=text,
metadata={
"title_of_parental_document": "Regression analysis",
"source": "https://en.wikipedia.org/wiki/Regression_analysis"
})
The issue with that code is the name of one of the metadata. It is not adherent to a what Weaviate expects: https://weaviate.io/developers/weaviate/config-refs/schema#properties
for instance: /[_A-Za-z][_0-9A-Za-z]*/
I believe we could add something here that would normalize the property name to be according to what Weaviate supports.
Thank you so much @dudanogueira! Now it perfectly works! My initially thought was to just raise an exception in the case of a name that is not correct for Weaviate's rules, something like the implementation for checking the collection's name (I am reffering to this code). I think that this could be enough as this way it would be clear to the user what the name must be. But of course you know what is better here.
Bug Description
Hello, While I was trying to use the Weaviate Vector Store, I found that when I try to insert a Document with metadata to it, then it is not actually inserted into the vector store. Note that no exception or warning is raised. You can detect the failure only if you try to print the contents of the vector store or use a retriever.
If the Document does not contain metadata, then the insertion is complete and the retriever can search on the document's chunks. We have talk with @logan-markewich on Discord, and the issue is probably connected with the Weaviate issue #5202.
Version
0.10.51 (llama-index) 1.0.0 (llama-index-vector-stores-weaviate) 4.6.5 (weaviate-client)
Steps to Reproduce
Create a Weaviate vector store index and then try to insert a document with metadata and one without metadata. Then use a retriever to retrieve the nodes or use the Weaviate method for listing the collection's contents and check if the ones from the document with metadata are included in the results. The following Python code is extracted from a Jupyter Notebook to showcase the steps to reproduce.