run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.38k stars 4.98k forks source link

[Question]: How to use Azure Table Storage for indexing large files #15258

Open prabhuperumalofficial opened 1 month ago

prabhuperumalofficial commented 1 month ago

Question Validation

Question

We are trying to use Azure Table Storage to store indexes as below

        storage_context = StorageContext.from_defaults(
            docstore=AzureDocumentStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                partition_key=service_name
            ),
            index_store=AzureIndexStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                partition_key=service_name
            )
        )
        storage_context.docstore.add_documents(nodes)

        index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
        index.storage_context.persist()

But it throws the following error when trying to index a larger file

2024-08-09 12:24:55,996 - ERROR - The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less. RequestId:5897b37c-a002-000e-5257-ea4e3b000000 Time:2024-08-09T12:24:55.9923447Z ErrorCode:PropertyValueTooLarge Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:5897b37c-a002-000e-5257-ea4e3b000000\nTime:2024-08-09T12:24:55.9923447Z"}}} Traceback (most recent call last): File "/usr/local/lib/python3.12/site-packages/azure/data/tables/_table_client.py", line 658, in upsert_entity self._client.table.update_entity( File "/usr/local/lib/python3.12/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/azure/data/tables/_generated/operations/_operations.py", line 1280, in update_entity raise HttpResponseError(response=response, model=error) azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Bad Request' Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:5897b37c-a002-000e-5257-ea4e3b000000\nTime:2024-08-09T12:24:55.9923447Z"}}}

dosubot[bot] commented 1 month ago

To resolve the "PropertyValueTooLarge" error when trying to index large files using Azure Table Storage, you need to split the large property value into multiple parts. Here is a function split_large_property_value that can help with this:

def split_large_property_value(num_parts: int, bytes_val: str, key: str) -> dict:
    """Split a large property value into multiple parts."""
    parts = {}
    for i in range(num_parts):
        start_index = i * STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
        end_index = start_index + STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
        # Convert back from UTF-16 bytes to str after slicing safely on character boundaries
        serialized_part = bytes_val[start_index:end_index].decode(
            "utf-16", errors="ignore"
        )
        parts[f"{key}{STORAGE_PART_KEY_DELIMITER}{i + 1}"] = serialized_part
    return parts

To use this function, follow these steps:

  1. Serialize the value to a JSON string and encode it to UTF-16 bytes.
  2. Compute the number of parts required to split the value.
  3. Split the value into parts using the split_large_property_value function.
  4. Store the parts in Azure Table Storage.

Here is an example of how you can do this:

def serialize_and_split_large_property(key: str, value: Any) -> dict:
    serialized_val, bytes_val, val_length = serialize_and_encode(value)
    num_parts = compute_table_property_part_count(val_length)
    return split_large_property_value(num_parts, bytes_val, key)

# Example usage
large_value = "..."  # Your large value here
key = "largeProperty"
split_parts = serialize_and_split_large_property(key, large_value)
# Now you can store `split_parts` in Azure Table Storage

Additionally, if you are using the AzureDocumentStore from the LlamaIndex framework, it handles splitting nodes across multiple properties to stay within Azure Table Storage limits. This can be particularly useful for large documents, as it allows you to store data up to the 1MiB "Item" limit by splitting the data across multiple properties [1][2].

To continue talking to Dosu, mention @dosu.

prabhuperumalofficial commented 1 month ago

@dosu

As in the original message we are using AzureDocumentStore from llama_index.storage.docstore.azure package

storage_context = StorageContext.from_defaults(
            docstore=AzureDocumentStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                partition_key=service_name
            ),
            index_store=AzureIndexStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                partition_key=service_name
            )
        )

        storage_context.docstore.add_documents(nodes)

        index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
        index.storage_context.persist()

Getting error in index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True) in this line

dosubot[bot] commented 1 month ago

The AzureDocumentStore class in the llama_index.storage.docstore.azure package is designed to handle large documents by splitting them across multiple properties to stay within Azure Table Storage limits. This should help avoid the "PropertyValueTooLarge" error.

Here is a relevant snippet from the AzureDocumentStore class that shows how it handles large documents:

async def async_add_documents(
    self,
    nodes: Sequence[BaseNode],
    allow_update: bool = True,
    batch_size: Optional[int] = None,
    store_text: bool = True,
) -> None:
    """Add documents to the store."""
    batch_size = batch_size or self._batch_size

    (
        node_kv_pairs,
        metadata_kv_pairs,
        ref_doc_kv_pairs,
    ) = await super()._async_prepare_kv_pairs(nodes, allow_update, store_text)

    # Change ref_doc_kv_pairs
    ref_doc_kv_pairs = self._extract_doc_metadatas(ref_doc_kv_pairs)

    await asyncio.gather(
        self._kvstore.aput_all(
            node_kv_pairs,
            collection=self._node_collection,
            batch_size=batch_size,
        ),
        self._kvstore.aput_all(
            metadata_kv_pairs,
            collection=self._metadata_collection,
            batch_size=batch_size,
        ),
        self._kvstore.aput_all(
            ref_doc_kv_pairs,
            collection=self._ref_doc_collection,
            batch_size=batch_size,
        ),
    )

This method ensures that large documents are split into smaller parts and stored across multiple properties.

Additionally, ensure that the serialize function in the llama_index.utils.azure.table module is correctly handling the serialization and splitting of large property values:

def serialize(service_mode: ServiceMode, value: dict) -> dict:
    """
    Serialize all values in a dictionary to JSON strings to ensure compatibility
    with Azure Table Storage. The Azure Table Storage API does not support
    complex data types like dictionaries or nested objects directly as values in
    entity properties; they need to be serialized to JSON strings.
    """
    item = {}
    num_properties = len(value)
    size_properties = 0
    for key, val in value.items():
        # Serialize all values for the sake of size calculation
        serialized_val, bytes_val, val_length = serialize_and_encode(val)

        size_properties += val_length
        validate_table_total_property_size(service_mode, size_properties)

        # Skips serialization for non-enums and non-serializable types
        if not isinstance(val, Enum) and isinstance(val, NON_SERIALIZABLE_TYPES):
            item[key] = val
            continue

        # Unlike Azure Table Storage, Cosmos DB does not have per-property limits
        if service_mode != ServiceMode.STORAGE:
            continue

        # No need to split the property into parts
        if val_length < STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE:
            item[key] = serialized_val
            continue

        num_parts = compute_table_property_part_count(val_length)
        num_properties += num_parts

        validate_table_property_count(num_properties)

        parts = split_large_property_value(num_parts, bytes_val, key)
        item.update(parts)

    return item

Ensure that your AzureDocumentStore and AzureIndexStore are correctly configured to handle large documents by splitting them into smaller parts. This should help resolve the "PropertyValueTooLarge" error [1][2].

To continue talking to Dosu, mention @dosu.

prabhuperumalofficial commented 1 month ago

@dosu

How to solve this issue with AzureDocumentStore? is there any additional properties to be passed to handle it automatically?

dosubot[bot] commented 1 month ago

To handle the "PropertyValueTooLarge" error automatically with AzureDocumentStore, ensure that it is initialized with the ServiceMode.STORAGE. This mode handles splitting large documents into smaller parts to stay within Azure Table Storage limits.

Here is how you can initialize the AzureDocumentStore with the ServiceMode.STORAGE:

from llama_index.storage.kvstore.azure.base import ServiceMode
from llama_index.storage.docstore.azure import AzureDocumentStore

connection_string = constants.STORAGE_ACCOUNT_CONNECTION_STRING
namespace = service_name

azure_document_store = AzureDocumentStore.from_connection_string(
    connection_string=connection_string,
    namespace=namespace,
    service_mode=ServiceMode.STORAGE
)

storage_context = StorageContext.from_defaults(
    docstore=azure_document_store,
    index_store=AzureIndexStore.from_connection_string(
        connection_string=connection_string,
        service_mode=ServiceMode.STORAGE,
        partition_key=service_name
    )
)

storage_context.docstore.add_documents(nodes)
index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
index.storage_context.persist()

By setting service_mode=ServiceMode.STORAGE, the AzureDocumentStore will handle splitting the data across multiple properties, thus avoiding the PropertyValueTooLarge error [1][2].

To continue talking to Dosu, mention @dosu.

prabhuperumalofficial commented 1 month ago

@dosu

Even the same results after using the above code

Can someone else help me here?

Code:

        storage_context = StorageContext.from_defaults(
            docstore=AzureDocumentStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                namespace=service_name
            ),
            index_store=AzureIndexStore.from_connection_string(
                constants.STORAGE_ACCOUNT_CONNECTION_STRING,
                service_mode=ServiceMode.STORAGE,
                partition_key=service_name
            )
        )

        storage_context.docstore.add_documents(nodes)

        index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
        index.storage_context.persist()

Stacktrace:

2024-08-09 23:43:00,806  - ERROR  -  The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:d97da7d5-3002-0041-1f87-ea3f6f000000
Time:2024-08-09T18:13:00.7082413Z
ErrorCode:PropertyValueTooLarge
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}
Traceback (most recent call last):
  File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_table_client.py", line 658, in upsert_entity
    self._client.table.update_entity(
  File "D:\llamaindex\venv\Lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_generated\operations\_operations.py", line 1280, in update_entity
    raise HttpResponseError(response=response, model=error)
azure.core.exceptions.HttpResponseError: Operation returned an invalid status 'Bad Request'
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\llamaindex\app\utils\helper.py", line 305, in persist_index_into_docstore
    index = VectorStoreIndex(nodes, storage_context=storage_context, show_progress=True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\indices\vector_store\base.py", line 78, in __init__
    super().__init__(
  File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\indices\base.py", line 98, in __init__
    self._storage_context.index_store.add_index_struct(self._index_struct)
  File "D:\llamaindex\venv\Lib\site-packages\llama_index\core\storage\index_store\keyval_index_store.py", line 46, in add_index_struct
    self._kvstore.put(key, data, collection=self._collection)
  File "D:\llamaindex\venv\Lib\site-packages\llama_index\storage\kvstore\azure\base.py", line 189, in put
    table_client.upsert_entity(
  File "D:\llamaindex\venv\Lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_table_client.py", line 672, in upsert_entity
    _process_table_error(error, table_name=self.table_name)
  File "D:\llamaindex\venv\Lib\site-packages\azure\data\tables\_error.py", line 210, in _process_table_error
    raise decoded_error from storage_error
azure.core.exceptions.HttpResponseError: The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:d97da7d5-3002-0041-1f87-ea3f6f000000
Time:2024-08-09T18:13:00.7082413Z
ErrorCode:PropertyValueTooLarge
Content: {"odata.error":{"code":"PropertyValueTooLarge","message":{"lang":"en-US","value":"The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.\nRequestId:d97da7d5-3002-0041-1f87-ea3f6f000000\nTime:2024-08-09T18:13:00.7082413Z"}}}
dosubot[bot] commented 1 month ago

This is a tough nut to crack, could really use your help on this, @logan-markewich.

To continue talking to Dosu, mention @dosu.

prabhuperumalofficial commented 3 weeks ago

@logan-markewich any updates on this issue?

prabhuperumalofficial commented 2 weeks ago

@dosu any help on this ticket would be appreciated

dosubot[bot] commented 2 weeks ago

To resolve the "PropertyValueTooLarge" error when using AzureDocumentStore from the llama_index.storage.docstore.azure package, you need to ensure that large property values are split into smaller parts. Here are the steps and relevant code to handle this:

  1. Serialize the value: Convert the value to a JSON string and encode it to UTF-16 bytes.
  2. Compute the number of parts: Determine how many parts the value needs to be split into based on the maximum property value size.
  3. Split the value: Divide the large property value into multiple parts.
  4. Validate the total size: Ensure the total size of all properties does not exceed the service limits.

Here is the relevant code to handle these steps:

def serialize_and_encode(value: Any) -> Tuple[str, bytes, int]:
    serialized_val = json.dumps(value)
    bytes_val = serialized_val.encode("utf-16", errors="ignore")
    val_length = len(bytes_val)
    return serialized_val, bytes_val, val_length

def compute_table_property_part_count(val_length: int) -> int:
    return val_length // STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE + (
        1 if val_length % STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE else 0
    )

def split_large_property_value(num_parts: int, bytes_val: str, key: str) -> dict:
    parts = {}
    for i in range(num_parts):
        start_index = i * STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
        end_index = start_index + STORAGE_MAX_ITEM_PROPERTY_VALUE_SIZE
        serialized_part = bytes_val[start_index:end_index].decode("utf-16", errors="ignore")
        parts[f"{key}{STORAGE_PART_KEY_DELIMITER}{i + 1}"] = serialized_part
    return parts

def validate_table_total_property_size(service_mode: ServiceMode, current_size: int) -> None:
    if service_mode == ServiceMode.STORAGE and current_size > STORAGE_MAX_TOTAL_PROPERTIES_SIZE_BYTES:
        raise ValueError(
            f"The total size of all properties in an Azure Table Storage Item "
            f"cannot exceed {STORAGE_MAX_TOTAL_PROPERTIES_SIZE_BYTES / 1048576}MiB.\n"
            "Consider splitting documents into smaller parts."
        )
    elif service_mode == ServiceMode.COSMOS and current_size > COSMOS_MAX_TOTAL_PROPERTIES_SIZE_BYTES:
        raise ValueError(
            f"The total size of all properties in an Azure Cosmos DB Item "
            f"cannot exceed {COSMOS_MAX_TOTAL_PROPERTIES_SIZE_BYTES / 1000000}MB.\n"
            "Consider splitting documents into smaller parts."
        )

By using these functions, you can split large property values into smaller parts to avoid the "PropertyValueTooLarge" error [1][2].

Additionally, ensure that your AzureDocumentStore and AzureIndexStore are correctly configured to handle large documents by splitting them into smaller parts. This should help resolve the "PropertyValueTooLarge" error.

To continue talking to Dosu, mention @dosu.