feat(ai): start corpora_ai abstract interface and corpora_ai_openai implementation

skyl commented 2 weeks ago

PR Type

Enhancement, Tests, Documentation

Description

Introduced an abstract interface LLMBaseInterface for LLM providers with methods for text completion and embedding generation.
Implemented dynamic loading of LLM providers with load_llm_provider, supporting OpenAI and handling errors for missing API keys or unsupported providers.
Developed OpenAIClient class to interact with OpenAI's API for text completion and embedding, including error handling for empty inputs.
Added comprehensive unit tests for provider loading and OpenAI client functionalities.
Documented the usage and features of corpora_ai and corpora_ai_openai, including setup instructions and API usage examples.
Updated project documentation to reflect the new directory structure and dependencies.

Changes walkthrough 📝

Relevant files

Enhancement

3 files

llm_interface.py `Define abstract interface for LLM providers` py/packages/corpora_ai/llm_interface.py Introduced `ChatCompletionTextMessage` dataclass for message representation. Defined `LLMBaseInterface` abstract class for LLM providers. Added abstract methods for text completion and embedding generation.	+45/-0
provider_loader.py `Implement dynamic LLM provider loading mechanism` py/packages/corpora_ai/provider_loader.py Implemented dynamic loading of LLM providers. Included OpenAI client with environment variable checks. Added error handling for missing API keys or unsupported providers.	+34/-0
llm_client.py `Implement OpenAI client for LLM interactions` py/packages/corpora_ai_openai/llm_client.py Implemented `OpenAIClient` class for OpenAI API interaction. Provided methods for text completion and embedding generation. Included error handling for empty inputs.	+32/-0

Tests

2 files

test_provider_loader.py `Add unit tests for LLM provider loader` py/packages/corpora_ai/test_provider_loader.py Added unit tests for `load_llm_provider` function. Tested scenarios for successful loading, missing API keys, and invalid providers.	+65/-0
test_llm_client.py `Add unit tests for OpenAI client` py/packages/corpora_ai_openai/test_llm_client.py Added unit tests for `OpenAIClient` methods. Tested text completion and embedding generation. Included tests for error handling on empty inputs.	+75/-0

Documentation

3 files

README.md `Document corpora_ai abstraction and usage` py/packages/corpora_ai/README.md Documented `corpora_ai` abstraction layer and usage. Explained provider loading and API usage for text completion and embedding.	+41/-0
README.md `Document OpenAI implementation and usage` py/packages/corpora_ai_openai/README.md Documented `corpora_ai_openai` features and usage. Provided instructions for initializing and using OpenAI client.	+44/-0
about-structure.md `Update directory structure documentation` md/prompts/corpora/about-structure.md - Updated directory structure documentation.	+17/-127

Dependencies

1 files

requirements.txt

Add OpenAI package dependency

py/packages/corpora_ai_openai/requirements.txt - Added OpenAI Python package dependency.

+1/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions[bot] commented 2 weeks ago

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪

🧪 PR contains tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Error Handling
The `load_llm_provider` function raises a `ValueError` if the `OPENAI_API_KEY` is not set or if no valid LLM provider is found. Consider whether this is the best way to handle these errors or if a more user-friendly error message or logging might be beneficial. Input Validation
The `OpenAIClient` class raises a `ValueError` for empty input in `get_text_completion` and `generate_embedding` methods. Ensure that this is the desired behavior and consider if additional input validation is necessary.

github-actions[bot] commented 2 weeks ago

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Best practice	Implement error handling for API calls to manage exceptions effectively ___ Add error handling for the OpenAI API calls to manage potential exceptions such as network issues or invalid API responses. [py/packages/corpora_ai_openai/llm_client.py [23-25]](https://github.com/skyl/corpora/pull/14/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R23-R25) ```diff -response = self.client.chat.completions.create( - model=self.completion_model, messages=message_dicts -) +try: + response = self.client.chat.completions.create( + model=self.completion_model, messages=message_dicts + ) +except Exception as e: + raise RuntimeError("Failed to get text completion from OpenAI API") from e ``` Suggestion importance[1-10]: 8 Why: Adding error handling for API calls is crucial for robustness, as it prevents the application from crashing due to network issues or invalid responses. This suggestion significantly enhances the reliability of the code.	8
Possible issue	Add validation for the API key to prevent initialization with an empty value ___ Validate the `api_key` parameter in the `OpenAIClient` constructor to ensure it is not empty or invalid. [py/packages/corpora_ai_openai/llm_client.py [14]](https://github.com/skyl/corpora/pull/14/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R14-R14) ```diff +if not api_key: + raise ValueError("API key must not be empty.") self.client = OpenAI(api_key=api_key) ``` Suggestion importance[1-10]: 7 Why: Validating the API key ensures that the client is not initialized with an invalid or empty key, which is essential for preventing runtime errors and ensuring proper API usage.	7
Possible bug	Verify the length of the embedding vector to ensure it matches expected dimensions ___ Ensure that the `generate_embedding` method checks the length of the returned embedding vector to confirm it meets expected dimensions. [py/packages/corpora_ai_openai/llm_client.py [31]](https://github.com/skyl/corpora/pull/14/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R31-R31) ```diff -return response.data[0].embedding +embedding = response.data[0].embedding +if len(embedding) != expected_length: + raise ValueError("Unexpected embedding vector length.") +return embedding ``` Suggestion importance[1-10]: 6 Why: Checking the length of the embedding vector can help catch unexpected API behavior or changes in the model's output, thus maintaining the integrity of the data processing pipeline. However, the suggestion lacks context on what the expected length should be, which limits its immediate applicability.	6

skyl / corpora