Open PulkitMishra opened 1 month ago
Most of this wouldn’t work by just instrumenting the graph object. Same with memory usage and cpu usage. Overall progress has to be measured on the server and for local run, on the run method of the Graph.
For logging the first priority is to isolate user code errors from SDK errors. Using the logger interface instead of prints are fine
Implement Comprehensive Monitoring for Long-running Workflows
Problem Description
Indexify currently lacks a comprehensive built-in solution for monitoring long-running workflows. This makes it difficult for users to track the progress, performance, and resource usage of their pipelines, especially in production environments.
Current Limitations
Limited Progress Tracking: In
remote_client.py
, theinvoke_graph_with_object
method provides basic event information:However, this doesn't give a clear picture of overall progress or estimated completion time.
No Performance Metrics: The
FunctionWorker
class infunction_worker.py
doesn't collect or report any performance metrics:There's no tracking of execution time, memory usage, or CPU utilization.
Lack of Centralized Logging: The current logging is scattered and inconsistent. For example, in
agent.py
:This approach doesn't provide a centralized, queryable log of system events and errors.
No Real-time Monitoring Interface: There's no built-in way for users to view the current state of their workflows in real-time.
Benefits of Implementing Monitoring
Proposed Solution
Implement a comprehensive monitoring system with the following components:
Metrics Collection:
Metrics
class to collect and aggregate performance data.FunctionWorker
,Graph
, andRemoteClient
to collect metrics.Centralized Logging:
Logger
class that provides structured logging with different severity levels.Progress Tracking:
Graph
class to include progress information for each node.RemoteClient
to report progress updates.Real-time Monitoring Interface:
Monitor
class that aggregates metrics, logs, and progress information.Alerting System:
**Testing
Related Issues
891 : Improve error handling in Python SDK