skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

feat(plan): work on issue example get_data_completion #20

Closed skyl closed 1 week ago

skyl commented 1 week ago

PR Type

enhancement, other


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
models.py
Add method to retrieve relevant text splits using embeddings

py/packages/corpora/models.py
  • Added get_relevant_splits method to Corpus model.
  • Utilizes CosineDistance for similarity calculation.
  • Integrates with load_llm_provider for embedding generation.
  • +19/-1   
    router.py
    Integrate plan router into main API router                             

    py/packages/corpora/router.py - Added `plan_router` to the API router.
    +2/-0     
    plan.py
    Add plan router with issue creation endpoint                         

    py/packages/corpora/routers/plan.py
  • Introduced new plan router for handling issue creation.
  • Defined IssueSchema for issue data structure.
  • Implemented get_issue endpoint to generate prospective issues.
  • +66/-0   
    llm_interface.py
    Extend LLM interface with data completion method                 

    py/packages/corpora_ai/llm_interface.py
  • Added get_data_completion abstract method to LLM interface.
  • Introduced generic type T for schema validation.
  • +20/-1   
    llm_client.py
    Implement data completion in OpenAI client                             

    py/packages/corpora_ai_openai/llm_client.py
  • Implemented get_data_completion method in OpenAIClient.
  • Utilizes OpenAI's function calling for structured data generation.
  • Validates output with Pydantic models.
  • +57/-2   
    Other
    provider_loader.py
    Add provider loader for issue tracking                                     

    py/packages/corpora_pm/providers/provider_loader.py
  • Added function to load issue tracker provider based on corpus URL.
  • Supports GitHub issue tracker.
  • +17/-0   

    ๐Ÿ’ก PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    github-actions[bot] commented 1 week ago

    PR Reviewer Guide ๐Ÿ”

    Here are some key observations to aid the review process:

    โฑ๏ธ Estimated effort to review: 4 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšช
    ๐Ÿงช No relevant tests
    ๐Ÿ”’ No security concerns identified
    โšก Recommended focus areas for review

    Code Smell
    The `get_relevant_splits` method in the `Corpus` model directly imports and uses `load_llm_provider` within the method. Consider moving the import to the top of the file for better readability and maintainability. Code Smell
    The `get_issue` function in the `plan` router uses an async call to fetch the `Corpus` object but does not await the `get_data_completion` method. Ensure that all async operations are properly awaited. Error Handling
    The `get_data_completion` method in `OpenAIClient` raises a generic `RuntimeError` for both `OpenAIError` and `json.JSONDecodeError`. Consider using more specific exception handling to provide clearer error messages.
    github-actions[bot] commented 1 week ago

    PR Code Suggestions โœจ

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Implement error handling for cases where the corpus is not found in the database ___ **Add error handling for the case where Corpus.objects.aget(id=corpus_id) does not
    find a matching corpus to prevent potential runtime errors.** [py/packages/corpora/routers/plan.py [45-46]](https://github.com/skyl/corpora/pull/20/files#diff-d17ee5521ab86acd2823afbc993ced9ea8afb2f074db451ff29c56b83f4e33aaR45-R46) ```diff -corpus = await Corpus.objects.aget(id=corpus_id) +try: + corpus = await Corpus.objects.aget(id=corpus_id) +except Corpus.DoesNotExist: + raise ValueError("Corpus not found") splits = corpus.get_relevant_splits(text) ```
    Suggestion importance[1-10]: 8 Why: Adding error handling for when a corpus is not found prevents potential runtime errors and improves the robustness of the code. This is a significant improvement as it addresses a possible bug.
    8
    Add a check for missing function arguments before parsing them to prevent runtime errors ___ **Ensure that the function_args variable is checked for None or empty values before
    attempting to parse it with json.loads to prevent potential runtime errors.** [py/packages/corpora_ai_openai/llm_client.py [75-76]](https://github.com/skyl/corpora/pull/20/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R75-R76) ```diff function_args = response.choices[0].message.function_call.arguments +if not function_args: + raise RuntimeError("Function arguments are missing") data_dict = json.loads(function_args) ```
    Suggestion importance[1-10]: 8 Why: The suggestion adds a necessary check for missing function arguments, which prevents runtime errors and enhances the reliability of the code. This is an important improvement as it addresses a potential bug.
    8
    Performance
    Optimize string concatenation by using a list comprehension and joining the strings at the end ___ **Consider using a more efficient method to concatenate strings, such as using a list
    and joining it at the end, to improve performance when building split_context.** [py/packages/corpora/models.py [54-56]](https://github.com/skyl/corpora/pull/20/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R54-R56) ```diff -split_context = "" -for split in splits: - split_context += f"{split.file.path}\n```\n{split.content}\n```" +split_context = "\n".join(f"{split.file.path}\n```\n{split.content}\n```" for split in splits) ```
    Suggestion importance[1-10]: 7 Why: The suggestion improves performance by using a more efficient method for string concatenation, which is beneficial when dealing with potentially large datasets. This change enhances the code's efficiency without altering its functionality.
    7
    skyl commented 1 week ago

    Next: interface to chat/refine issue and then POST to github.

    skyl commented 1 week ago

    /ask "get_data_completion is freaking hot - complete a conversation by returning a defined pydantic BaseModel schema - can you explain how this is so cool?"

    github-actions[bot] commented 1 week ago

    Askโ“

    get_data_completion is freaking hot - complete a conversation by returning a defined pydantic basemodel schema - can you explain how this is so cool?

    Answer:

    The get_data_completion method introduced in this PR is indeed a notable enhancement. Here's why it's particularly interesting:

    1. Structured Output: Unlike traditional text generation methods that return unstructured text, get_data_completion returns data that conforms to a specified Pydantic BaseModel schema. This ensures that the output is not only syntactically correct but also semantically structured according to predefined rules. This is crucial for applications where the output needs to be further processed or integrated into other systems.

    2. Validation and Type Safety: By using Pydantic models, the method leverages Pydantic's validation capabilities. This means that the generated data is automatically validated against the schema, ensuring that it meets all the specified constraints (e.g., types, formats). This reduces the risk of runtime errors due to unexpected data formats.

    3. Integration with OpenAI's Function Calling: The method uses OpenAI's function calling feature, which allows the model to generate structured data directly. This is a powerful feature because it combines the flexibility of language models with the precision of structured data generation, making it suitable for tasks like generating API responses, filling out forms, or creating configuration files.

    4. Error Handling: The implementation includes error handling for both OpenAI API errors and JSON parsing errors. This makes the method robust and reliable, providing clear feedback in case of failures.

    5. Versatility: The method can be used in various contexts where structured data is needed from conversational input. For example, it can be used to generate issue reports, fill out database entries, or even automate parts of a workflow that require specific data formats.

    Overall, get_data_completion enhances the capability of the system to produce reliable, structured outputs from conversational inputs, bridging the gap between natural language processing and structured data requirements.