PulkitMishra commented 1 month ago

Improve Error Handling in Indexify Python SDK

Issue Description

The current implementation of the Indexify Python SDK lacks robust error handling and reporting mechanisms.

Specific Examples

In indexify/remote_client.py, the _request method:

def _request(self, method: str, **kwargs) -> httpx.Response:
    try:
        response = self._client.request(method, timeout=self._timeout, **kwargs)
        status_code = str(response.status_code)
        if status_code.startswith("4"):
            raise ApiException(
                "status code: " + status_code + " request args: " + str(kwargs)
            )
        if status_code.startswith("5"):
            raise ApiException(response.text)
    except httpx.ConnectError:
        message = (
            f"Make sure the server is running and accesible at {self._service_url}"
        )
        ex = ApiException(status="ConnectionError", message=message)
        print(ex)
        raise ex
    return response

Issues:

Only handles httpx.ConnectError, ignoring other potential exceptions.
Prints the exception before raising it, which may not be appropriate for all use cases.
Doesn't provide detailed context about the failed operation.

In indexify/executor/function_worker.py, the async_submit method:

async def async_submit(
    self,
    namespace: str,
    graph_name: str,
    fn_name: str,
    input: IndexifyData,
    code_path: str,
    version: int,
    init_value: Optional[IndexifyData] = None,
) -> FunctionWorkerOutput:
    try:
        result = await asyncio.get_running_loop().run_in_executor(
            self._executor,
            _run_function,
            namespace,
            graph_name,
            fn_name,
            input,
            code_path,
            version,
            init_value,
        )
    except BrokenProcessPool as mp:
        self._executor.shutdown(wait=True, cancel_futures=True)
        traceback.print_exc()
        raise mp
    except FunctionRunException as e:
        print(e)
        print(traceback.format_exc())
        return FunctionWorkerOutput(
            exception=str(e),
            stdout=e.stdout,
            stderr=e.stderr,
            reducer=e.is_reducer,
            success=False,
        )

Issues:

Prints exception information directly, which may not be appropriate for all environments.
Doesn't provide a way to customize error handling or logging.
Doesn't capture or report the full context of the error (e.g., input data, function details).

In indexify/executor/agent.py, the task_completion_reporter method:

async def task_completion_reporter(self):
    console.print(Text("Starting task completion reporter", style="bold cyan"))
    url = f"{self._protocol}://{self._server_addr}/write_content"
    while True:
        outcomes = await self._task_store.task_outcomes()
        for task_outcome in outcomes:
            # ... (omitted for brevity)
            try:
                self._task_reporter.report_task_outcome(completed_task=task_outcome)
            except Exception as e:
                console.print(
                    Panel(
                        f"Failed to report task {task_outcome.task.id}\n"
                        f"Exception: {e}\nRetrying...",
                        title="Reporting Error",
                        border_style="error",
                    )
                )
                await asyncio.sleep(5)
                continue

Issues:

Uses a generic Exception catch, which may mask specific errors.
Doesn't provide a mechanism for custom error handling or reporting.
The retry mechanism is simplistic and may lead to infinite retries for persistent errors.

Proposed Solution

Create a custom exception hierarchy:
- Implement a base IndexifyException class.
- Create specific exception subclasses for different types of errors (e.g., NetworkError, ExecutionError, ConfigurationError).
Implement a centralized error handling and logging mechanism:
- Create an ErrorHandler class that can be configured with custom logging and reporting options.
- Use this ErrorHandler consistently throughout the SDK.
Enhance error context:
- Modify exception classes to include more context (e.g., function name, input data summary, graph details).
- Implement a method to safely serialize error context, avoiding potential issues with unpicklable objects.
Improve retry mechanisms:
- Implement an exponential backoff strategy for retries.
- Allow configuration of retry attempts and conditions.
Add error callback support:
- Allow users to register custom error callbacks for specific types of errors.

Implementation Plan

Define the exception hierarchy in a new file indexify/exceptions.py.
Implement the ErrorHandler class in indexify/error_handling.py.
Modify existing code to use the new exception classes and ErrorHandler:
- Update remote_client.py to use specific exceptions and the ErrorHandler.
- Refactor function_worker.py to provide more context in errors and use the ErrorHandler.
- Enhance agent.py with improved error handling and retry logic.
Add configuration options for error handling in the client initialization.
Update documentation to reflect the new error handling capabilities.
Add unit tests for the new error handling mechanisms.

stangirala commented 1 month ago

There's a few different types of errors. I guess this will surface when you make the changes.

For the worker and related code (anything that ends up touching the user, ie. graph, code), does it make sense to capture only the task Id rather than a lot of data such as inputs? With task id we might be able to use the server blog store to retrieve inputs/outputs.

PulkitMishra commented 1 month ago

@stangirala ah ok i get the context that you were talking about now wrt #900 . few thoughts

Error Categorization: I think as we implement the new error handling system, we'll create a flexible hierarchy of error types. This should allow us to categorize errors more precisely as they surface during development and testing.
Error Context: For worker-related errors, you're right that including full input data might be excessive. What we can do is : a) Include only essential information in the initial error report (task ID, error type, brief message). b) Provide a method to fetch detailed error context on demand, using the task ID to retrieve data from the server's blob store. This will reduce the amount of potentially sensitive data in error logs, minimize performance impact of error reporting, allow for more detailed debugging when necessary. as we all know how some that some workflow orchestrators that shall remain unnamed are the bane of folks doing mle simply because of garbage logging and not providing logs of intermediate stages in a nice tidy way
Configurable Error Detail Level: We can obviously add configuration options allowing users to set the level of detail included in error reports. This could range from minimal (just task ID and error type) to comprehensive (including sanitized input data summaries).

PulkitMishra commented 1 month ago

related #909

tensorlakeai / indexify