tensorlakeai / indexify

A realtime serving engine for Data-Intensive Generative AI Applications
https://docs.tensorlake.ai
Apache License 2.0
914 stars 117 forks source link

Implement Comprehensive Monitoring in Python SDK #892

Open PulkitMishra opened 1 month ago

PulkitMishra commented 1 month ago

Implement Comprehensive Monitoring for Long-running Workflows

Problem Description

Indexify currently lacks a comprehensive built-in solution for monitoring long-running workflows. This makes it difficult for users to track the progress, performance, and resource usage of their pipelines, especially in production environments.

Current Limitations

  1. Limited Progress Tracking: In remote_client.py, the invoke_graph_with_object method provides basic event information:

    print(f"[bold green]{event.event_name}[/bold green]: {event.payload}")

    However, this doesn't give a clear picture of overall progress or estimated completion time.

  2. No Performance Metrics: The FunctionWorker class in function_worker.py doesn't collect or report any performance metrics:

    class FunctionWorker:
       def __init__(self, workers: int = 1) -> None:
           self._executor: concurrent.futures.ProcessPoolExecutor = (
               concurrent.futures.ProcessPoolExecutor(max_workers=workers)
           )

    There's no tracking of execution time, memory usage, or CPU utilization.

  3. Lack of Centralized Logging: The current logging is scattered and inconsistent. For example, in agent.py:

    console.print(f"[bold]task-reporter[/bold] uploading output of size: {len(completed_task.outputs or [])}")

    This approach doesn't provide a centralized, queryable log of system events and errors.

  4. No Real-time Monitoring Interface: There's no built-in way for users to view the current state of their workflows in real-time.

Benefits of Implementing Monitoring

  1. Improved Observability: Users will be able to track the progress of their workflows, identify bottlenecks, and estimate completion times.
  2. Performance Optimization: Collected metrics will help users optimize their workflows and resource allocation.
  3. Easier Debugging: Comprehensive logging and error reporting will make it easier to identify and fix issues in complex workflows.
  4. Resource Management: Monitoring resource usage will help prevent out-of-memory errors and optimize cloud resource allocation.

Proposed Solution

Implement a comprehensive monitoring system with the following components:

  1. Metrics Collection:

    • Add a Metrics class to collect and aggregate performance data.
    • Instrument key methods in FunctionWorker, Graph, and RemoteClient to collect metrics.
  2. Centralized Logging:

    • Implement a Logger class that provides structured logging with different severity levels.
    • Replace print statements with calls to the logger.
    • Add context information (e.g., graph name, function name) to log messages.
  3. Progress Tracking:

    • Extend the Graph class to include progress information for each node.
    • Implement a progress calculation algorithm that considers the graph structure.
    • Modify RemoteClient to report progress updates.
  4. Real-time Monitoring Interface:

    • Create a Monitor class that aggregates metrics, logs, and progress information.
    • Implement a simple web interface using Flask or FastAPI to display real-time monitoring data.
    • Create visualizations for metrics and progress (e.g., using Plotly).
  5. Alerting System:

    • Add configurable alerts for specific events or metric thresholds.
    • Implement notification mechanisms (e.g., email, Slack) for alerts.
  6. **Testing

    • Write unit tests for new classes and methods.
    • Update existing tests to work with new monitoring system.

Related Issues

diptanu commented 1 month ago

Most of this wouldn’t work by just instrumenting the graph object. Same with memory usage and cpu usage. Overall progress has to be measured on the server and for local run, on the run method of the Graph.

For logging the first priority is to isolate user code errors from SDK errors. Using the logger interface instead of prints are fine