Centralize and standardize retry and exception handling across services using a retry utility.

The existing program has a high-level design flaw with respect to error handling and retry mechanisms in various service functions:

Issue

The retry logic in the services/llm module applies a retry mechanism for handling exceptions during API requests. However, this retry logic is currently hardcoded within the module and is applied in isolation without consideration for potentially conflicting retry strategies in other parts of the application.

Proposed Solution

Centralize Exception Handling - Create a centralized retry and exception handling utility that can be imported and used across different modules. This will allow for a consistent retry logic and easier maintenance.
Configuration Driven - Externalize the retry configuration to a settings file or environment variables, which will allow the retry parameters to be tuned without modifying the codebase.
Logging Enhancements - Improve the logging to provide more context about retries and failures.

Example Code Snippet

# utils/retry_util.py

import functools
import logging
from time import sleep

def retry_on_exception(exception_to_check, tries=4, delay=3, backoff=2, logger=None):
    def decorator_retry(func):
        @functools.wraps(func)
        def func_retry(*args, **kwargs):
            _logger = logger or logging.getLogger(__name__)
            mtries, mdelay = tries, delay

            while mtries > 0:
                try:
                    return func(*args, **kwargs)
                except exception_to_check as e:
                    _logger.warning(f"{str(e)}, retrying in {mdelay} seconds...")
                    sleep(mdelay)
                    mtries -= 1
                    mdelay *= backoff
            return func(*args, **kwargs)
        return func_retry
    return decorator_retry

Then, apply this utility across the necessary modules like services/llm:

# In services/llm/__init__.py

from utils.retry_util import retry_on_exception

@retry_on_exception(llm_exceptions.LLMException, tries=3, delay=2, backoff=2)
def generate_text(
    messages: List[Dict[str, str]],
    openai_client: openai.OpenAI,
) -> str:
    # Existing implementation
    ...

By implementing the above suggestions, the application will achieve more robust error handling and retry logic, which will not only improve the reliability but also make it easier to manage and configure.

tawada / grass-grower