Open prabhuperumalofficial opened 1 month ago
To resolve the "PropertyValueTooLarge" error when trying to index large files using Azure Table Storage, you need to split the large property value into multiple parts. Here is a function split_large_property_value
that can help with this:
def split_large_property_value(num_parts: int, bytes_val: str, key: str) -> dict:
"""Split a large property value into multiple parts."""
parts = {}
for i in range(num_parts):
start_index = i * STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
end_index = start_index + STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
# Convert back from UTF-16 bytes to str after slicing safely on character boundaries
serialized_part = bytes_val[start_index:end_index].decode(
"utf-16", errors="ignore"
)
parts[f"{key}{STORAGE_PART_KEY_DELIMITER}{i + 1}"] = serialized_part
return parts
To use this function, follow these steps:
split_large_property_value
function.Here is an example of how you can do this:
def serialize_and_split_large_property(key: str, value: Any) -> dict:
serialized_val, bytes_val, val_length = serialize_and_encode(value)
num_parts = compute_table_property_part_count(val_length)
return split_large_property_value(num_parts, bytes_val, key)
# Example usage
large_value = "..." # Your large value here
key = "largeProperty"
split_parts = serialize_and_split_large_property(key, large_value)
# Now you can store `split_parts` in Azure Table Storage
Additionally, if you are using the AzureDocumentStore
from the LlamaIndex framework, it handles splitting nodes across multiple properties to stay within Azure Table Storage limits. This can be particularly useful for large documents, as it allows you to store data up to the 1MiB "Item" limit by splitting the data across multiple properties [1][2].
To continue talking to Dosu, mention @dosu.
@dosu
As in the original message we are using AzureDocumentStore from llama_index.storage.docstore.azure package
storage_context = StorageContext.from_defaults(
docstore=AzureDocumentStore.from_connection_string(
constants.STORAGE_ACCOUNT_CONNECTION_STRING,
service_mode=ServiceMode.STORAGE,
partition_key=service_name
),
index_store=AzureIndexStore.from_connection_string(
constants.STORAGE_ACCOUNT_CONNECTION_STRING,
service_mode=ServiceMode.STORAGE,
partition_key=service_name
)
)
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
index.storage_context.persist()
Getting error in index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
in this line
The AzureDocumentStore
class in the llama_index.storage.docstore.azure
package is designed to handle large documents by splitting them across multiple properties to stay within Azure Table Storage limits. This should help avoid the "PropertyValueTooLarge" error.
Here is a relevant snippet from the AzureDocumentStore
class that shows how it handles large documents:
async def async_add_documents(
self,
nodes: Sequence[BaseNode],
allow_update: bool = True,
batch_size: Optional[int] = None,
store_text: bool = True,
) -> None:
"""Add documents to the store."""
batch_size = batch_size or self._batch_size
(
node_kv_pairs,
metadata_kv_pairs,
ref_doc_kv_pairs,
) = await super()._async_prepare_kv_pairs(nodes, allow_update, store_text)
# Change ref_doc_kv_pairs
ref_doc_kv_pairs = self._extract_doc_metadatas(ref_doc_kv_pairs)
await asyncio.gather(
self._kvstore.aput_all(
node_kv_pairs,
collection=self._node_collection,
batch_size=batch_size,
),
self._kvstore.aput_all(
metadata_kv_pairs,
collection=self._metadata_collection,
batch_size=batch_size,
),
self._kvstore.aput_all(
ref_doc_kv_pairs,
collection=self._ref_doc_collection,
batch_size=batch_size,
),
)
This method ensures that large documents are split into smaller parts and stored across multiple properties.
Additionally, ensure that the serialize
function in the llama_index.utils.azure.table
module is correctly handling the serialization and splitting of large property values:
def serialize(service_mode: ServiceMode, value: dict) -> dict:
"""
Serialize all values in a dictionary to JSON strings to ensure compatibility
with Azure Table Storage. The Azure Table Storage API does not support
complex data types like dictionaries or nested objects directly as values in
entity properties; they need to be serialized to JSON strings.
"""
item = {}
num_properties = len(value)
size_properties = 0
for key, val in value.items():
# Serialize all values for the sake of size calculation
serialized_val, bytes_val, val_length = serialize_and_encode(val)
size_properties += val_length
validate_table_total_property_size(service_mode, size_properties)
# Skips serialization for non-enums and non-serializable types
if not isinstance(val, Enum) and isinstance(val, NON_SERIALIZABLE_TYPES):
item[key] = val
continue
# Unlike Azure Table Storage, Cosmos DB does not have per-property limits
if service_mode != ServiceMode.STORAGE:
continue
# No need to split the property into parts
if val_length < STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE:
item[key] = serialized_val
continue
num_parts = compute_table_property_part_count(val_length)
num_properties += num_parts
validate_table_property_count(num_properties)
parts = split_large_property_value(num_parts, bytes_val, key)
item.update(parts)
return item
Ensure that your AzureDocumentStore
and AzureIndexStore
are correctly configured to handle large documents by splitting them into smaller parts. This should help resolve the "PropertyValueTooLarge" error [1][2].
To continue talking to Dosu, mention @dosu.
@dosu
How to solve this issue with AzureDocumentStore? is there any additional properties to be passed to handle it automatically?
To handle the "PropertyValueTooLarge" error automatically with AzureDocumentStore
, ensure that it is initialized with the ServiceMode.STORAGE
. This mode handles splitting large documents into smaller parts to stay within Azure Table Storage limits.
Here is how you can initialize the AzureDocumentStore
with the ServiceMode.STORAGE
:
from llama_index.storage.kvstore.azure.base import ServiceMode
from llama_index.storage.docstore.azure import AzureDocumentStore
connection_string = constants.STORAGE_ACCOUNT_CONNECTION_STRING
namespace = service_name
azure_document_store = AzureDocumentStore.from_connection_string(
connection_string=connection_string,
namespace=namespace,
service_mode=ServiceMode.STORAGE
)
storage_context = StorageContext.from_defaults(
docstore=azure_document_store,
index_store=AzureIndexStore.from_connection_string(
connection_string=connection_string,
service_mode=ServiceMode.STORAGE,
partition_key=service_name
)
)
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
index.storage_context.persist()
By setting service_mode=ServiceMode.STORAGE
, the AzureDocumentStore
will handle splitting the data across multiple properties, thus avoiding the PropertyValueTooLarge
error [1][2].
To continue talking to Dosu, mention @dosu.
@dosu
Even the same results after using the above code
Can someone else help me here?
Code:
storage_context = StorageContext.from_defaults(
docstore=AzureDocumentStore.from_connection_string(
constants.STORAGE_ACCOUNT_CONNECTION_STRING,
service_mode=ServiceMode.STORAGE,
namespace=service_name
),
index_store=AzureIndexStore.from_connection_string(
constants.STORAGE_ACCOUNT_CONNECTION_STRING,
service_mode=ServiceMode.STORAGE,
partition_key=service_name
)
)
storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
index.storage_context.persist()
Stacktrace:
2024-08-09 23:43:00,806 - ERROR - The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:d97da7d5-3002-0041-1f87-ea3f6f000000
Time:2024-08-09T18:13:00.7082413Z
ErrorCode:PropertyValueTooLarge
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}
Traceback (most recent call last):
File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_table_client.py", line 658, in upsert_entity
self._client.table.update_entity(
File "D:\llamaindex\venv\Lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_generated\operations\_operations.py", line 1280, in update_entity
raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Bad Request'
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\llamaindex\app\utils\helper.py", line 305, in persist_index_into_docstore
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\indices\vector_store\base.py", line 78, in __init__
super().__init__(
File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\indices\base.py", line 98, in __init__
self._storage_context.index_store.add_index_struct(self._index_struct)
File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\storage\index_store\keyval_index_store.py", line 46, in add_index_struct
self._kvstore.put(key, data, collection=self._collection)
File "D:\llamaindex\venv\Lib\site-packages\llama_index\storage\kvstore\azure\base.py", line 189, in put
table_client.upsert_entity(
File "D:\llamaindex\venv\Lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_table_client.py", line 672, in upsert_entity
_process_table_error(error, table_name=self.table_name)
File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_error.py", line 210, in _process_table_error
raise decoded_error from storage_error
azure.core.exceptions.HttpResponseError: The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:d97da7d5-3002-0041-1f87-ea3f6f000000
Time:2024-08-09T18:13:00.7082413Z
ErrorCode:PropertyValueTooLarge
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}
This is a tough nut to crack, could really use your help on this, @logan-markewich.
To continue talking to Dosu, mention @dosu.
@logan-markewich any updates on this issue?
@dosu any help on this ticket would be appreciated
To resolve the "PropertyValueTooLarge" error when using AzureDocumentStore
from the llama_index.storage.docstore.azure
package, you need to ensure that large property values are split into smaller parts. Here are the steps and relevant code to handle this:
Here is the relevant code to handle these steps:
def serialize_and_encode(value: Any) -> Tuple[str, bytes, int]:
serialized_val = json.dumps(value)
bytes_val = serialized_val.encode("utf-16", errors="ignore")
val_length = len(bytes_val)
return serialized_val, bytes_val, val_length
def compute_table_property_part_count(val_length: int) -> int:
return val_length // STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE + (
1 if val_length % STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE else 0
)
def split_large_property_value(num_parts: int, bytes_val: str, key: str) -> dict:
parts = {}
for i in range(num_parts):
start_index = i * STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
end_index = start_index + STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
serialized_part = bytes_val[start_index:end_index].decode("utf-16", errors="ignore")
parts[f"{key}{STORAGE_PART_KEY_DELIMITER}{i + 1}"] = serialized_part
return parts
def validate_table_total_property_size(service_mode: ServiceMode, current_size: int) -> None:
if service_mode == ServiceMode.STORAGE and current_size > STORAGE_MAX_TOTAL_PROPERTIES_SIZE_BYTES:
raise ValueError(
f"The total size of all properties in an Azure Table Storage Item "
f"cannot exceed {STORAGE_MAX_TOTAL_PROPERTIES_SIZE_BYTES / 1048576}MiB.\n"
"Consider splitting documents into smaller parts."
)
elif service_mode == ServiceMode.COSMOS and current_size > COSMOS_MAX_TOTAL_PROPERTIES_SIZE_BYTES:
raise ValueError(
f"The total size of all properties in an Azure Cosmos DB Item "
f"cannot exceed {COSMOS_MAX_TOTAL_PROPERTIES_SIZE_BYTES / 1000000}MB.\n"
"Consider splitting documents into smaller parts."
)
By using these functions, you can split large property values into smaller parts to avoid the "PropertyValueTooLarge" error [1][2].
Additionally, ensure that your AzureDocumentStore
and AzureIndexStore
are correctly configured to handle large documents by splitting them into smaller parts. This should help resolve the "PropertyValueTooLarge" error.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
We are trying to use Azure Table Storage to store indexes as below
But it throws the following error when trying to index a larger file
2024-08-09 12:24:55,996 - ERROR - The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less. RequestId:5897b37c-a002-000e-5257-ea4e3b000000 Time:2024-08-09T12:24:55.9923447Z ErrorCode:PropertyValueTooLarge Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:5897b37c-a002-000e-5257-ea4e3b000000\nTime:2024-08-09T12:24:55.9923447Z"}}} Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/azure/data/tables/_table_client.py", line 658, in upsert_entity self._client.table.update_entity( File "/usr/local/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/azure/data/tables/_generated/operations/_operations.py", line 1280, in update_entity raise HttpResponseError(response=response, model=error) azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Bad Request' Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:5897b37c-a002-000e-5257-ea4e3b000000\nTime:2024-08-09T12:24:55.9923447Z"}}}