feat(plan): work on issue example get_data_completion

skyl commented 1 week ago

PR Type

enhancement, other

Description

Added a new method get_relevant_splits to the Corpus model to retrieve relevant text splits using embeddings and cosine similarity.
Introduced a new plan router with an endpoint to generate prospective issues based on text input.
Extended the LLM interface with a get_data_completion method for generating structured data.
Implemented the get_data_completion method in the OpenAIClient to use OpenAI's function calling for structured data generation.
Added a provider loader to determine the issue tracker provider based on the corpus URL, supporting GitHub.

Changes walkthrough 📝

Relevant files

Enhancement

models.py `Add method to retrieve relevant text splits using embeddings` py/packages/corpora/models.py Added `get_relevant_splits` method to `Corpus` model. Utilizes `CosineDistance` for similarity calculation. Integrates with `load_llm_provider` for embedding generation.	+19/-1
router.py `Integrate plan router into main API router` py/packages/corpora/router.py - Added `plan_router` to the API router.	+2/-0
plan.py `Add plan router with issue creation endpoint` py/packages/corpora/routers/plan.py Introduced new `plan` router for handling issue creation. Defined `IssueSchema` for issue data structure. Implemented `get_issue` endpoint to generate prospective issues.	+66/-0
llm_interface.py `Extend LLM interface with data completion method` py/packages/corpora_ai/llm_interface.py Added `get_data_completion` abstract method to LLM interface. Introduced generic type `T` for schema validation.	+20/-1
llm_client.py `Implement data completion in OpenAI client` py/packages/corpora_ai_openai/llm_client.py Implemented `get_data_completion` method in `OpenAIClient`. Utilizes OpenAI's function calling for structured data generation. Validates output with Pydantic models.	+57/-2

Other

provider_loader.py

Add provider loader for issue tracking

py/packages/corpora_pm/providers/provider_loader.py

Added function to load issue tracker provider based on corpus URL.

Supports GitHub issue tracker.

+17/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions[bot] commented 1 week ago

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 No relevant tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Code Smell
The `get_relevant_splits` method in the `Corpus` model directly imports and uses `load_llm_provider` within the method. Consider moving the import to the top of the file for better readability and maintainability. Code Smell
The `get_issue` function in the `plan` router uses an async call to fetch the `Corpus` object but does not await the `get_data_completion` method. Ensure that all async operations are properly awaited. Error Handling
The `get_data_completion` method in `OpenAIClient` raises a generic `RuntimeError` for both `OpenAIError` and `json.JSONDecodeError`. Consider using more specific exception handling to provide clearer error messages.

github-actions[bot] commented 1 week ago

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Possible bug	Implement error handling for cases where the corpus is not found in the database ___ Add error handling for the case where `Corpus.objects.aget(id=corpus_id)` does not find a matching corpus to prevent potential runtime errors. [py/packages/corpora/routers/plan.py [45-46]](https://github.com/skyl/corpora/pull/20/files#diff-d17ee5521ab86acd2823afbc993ced9ea8afb2f074db451ff29c56b83f4e33aaR45-R46) ```diff -corpus = await Corpus.objects.aget(id=corpus_id) +try: + corpus = await Corpus.objects.aget(id=corpus_id) +except Corpus.DoesNotExist: + raise ValueError("Corpus not found") splits = corpus.get_relevant_splits(text) ``` Suggestion importance[1-10]: 8 Why: Adding error handling for when a corpus is not found prevents potential runtime errors and improves the robustness of the code. This is a significant improvement as it addresses a possible bug.	8
Possible bug	Add a check for missing function arguments before parsing them to prevent runtime errors ___ Ensure that the `function_args` variable is checked for `None` or empty values before attempting to parse it with `json.loads` to prevent potential runtime errors. [py/packages/corpora_ai_openai/llm_client.py [75-76]](https://github.com/skyl/corpora/pull/20/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R75-R76) ```diff function_args = response.choices[0].message.function_call.arguments +if not function_args: + raise RuntimeError("Function arguments are missing") data_dict = json.loads(function_args) ``` Suggestion importance[1-10]: 8 Why: The suggestion adds a necessary check for missing function arguments, which prevents runtime errors and enhances the reliability of the code. This is an important improvement as it addresses a potential bug.	8
Performance	Optimize string concatenation by using a list comprehension and joining the strings at the end ___ Consider using a more efficient method to concatenate strings, such as using a list and joining it at the end, to improve performance when building `split_context`. [py/packages/corpora/models.py [54-56]](https://github.com/skyl/corpora/pull/20/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R54-R56) ```diff -split_context = "" -for split in splits: - split_context += f"{split.file.path}\n```\n{split.content}\n```" +split_context = "\n".join(f"{split.file.path}\n```\n{split.content}\n```" for split in splits) ``` Suggestion importance[1-10]: 7 Why: The suggestion improves performance by using a more efficient method for string concatenation, which is beneficial when dealing with potentially large datasets. This change enhances the code's efficiency without altering its functionality.	7

skyl commented 1 week ago

Next: interface to chat/refine issue and then POST to github.

skyl commented 1 week ago

/ask "get_data_completion is freaking hot - complete a conversation by returning a defined pydantic BaseModel schema - can you explain how this is so cool?"

github-actions[bot] commented 1 week ago

Ask❓

get_data_completion is freaking hot - complete a conversation by returning a defined pydantic basemodel schema - can you explain how this is so cool?

Answer:

The get_data_completion method introduced in this PR is indeed a notable enhancement. Here's why it's particularly interesting:

Structured Output: Unlike traditional text generation methods that return unstructured text, get_data_completion returns data that conforms to a specified Pydantic BaseModel schema. This ensures that the output is not only syntactically correct but also semantically structured according to predefined rules. This is crucial for applications where the output needs to be further processed or integrated into other systems.
Validation and Type Safety: By using Pydantic models, the method leverages Pydantic's validation capabilities. This means that the generated data is automatically validated against the schema, ensuring that it meets all the specified constraints (e.g., types, formats). This reduces the risk of runtime errors due to unexpected data formats.
Integration with OpenAI's Function Calling: The method uses OpenAI's function calling feature, which allows the model to generate structured data directly. This is a powerful feature because it combines the flexibility of language models with the precision of structured data generation, making it suitable for tasks like generating API responses, filling out forms, or creating configuration files.
Error Handling: The implementation includes error handling for both OpenAI API errors and JSON parsing errors. This makes the method robust and reliable, providing clear feedback in case of failures.
Versatility: The method can be used in various contexts where structured data is needed from conversational input. For example, it can be used to generate issue reports, fill out database entries, or even automate parts of a workflow that require specific data formats.

Overall, get_data_completion enhances the capability of the system to produce reliable, structured outputs from conversational inputs, bridging the gap between natural language processing and structured data requirements.

skyl / corpora