skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

feat(CLI): search splits #19

Closed skyl closed 1 week ago

skyl commented 1 week ago

PR Type

Enhancement, Tests


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
4 files
corpustextfile.py
Add endpoint to retrieve files by path in corpus                 

py/packages/corpora/routers/corpustextfile.py
  • Added a new endpoint to retrieve files by path within a corpus.
  • Introduced query parameter handling for file path retrieval.
  • +27/-9   
    split.py
    Implement CLI commands for split operations                           

    py/packages/corpora_cli/commands/split.py
  • Implemented CLI commands for searching and listing splits.
  • Added error handling for missing arguments and invalid limits.
  • +42/-0   
    main.py
    Register split commands in CLI                                                     

    py/packages/corpora_cli/main.py - Registered new split commands in the CLI application.
    +2/-1     
    file_api.py
    Add API client methods for file retrieval by path               

    py/packages/corpora_client/api/file_api.py
  • Added methods to retrieve files by path in the API client.
  • Included serialization and deserialization logic for new endpoint.
  • +261/-1 
    Tests
    3 files
    test_corpustextfile.py
    Add tests for file retrieval by path                                         

    py/packages/corpora/routers/test_corpustextfile.py
  • Added tests for retrieving files by path.
  • Tested scenarios for non-existent files and missing query parameters.
  • +45/-0   
    test_split.py
    Add tests for CLI split commands                                                 

    py/packages/corpora_cli/commands/test_split.py
  • Added tests for CLI split search and list commands.
  • Included tests for invalid input scenarios.
  • +139/-0 
    test_file_api.py
    Add test case for file retrieval by path                                 

    py/packages/corpora_client/test/test_file_api.py - Added a placeholder test case for file retrieval by path.
    +7/-0     
    Configuration changes
    1 files
    .corpora.yaml
    Update project metadata in configuration                                 

    .corpora.yaml - Updated project ID and name.
    +3/-4     
    Documentation
    2 files
    README.md
    Document new API method for file retrieval by path             

    py/packages/corpora_client/README.md - Documented the new API method for file retrieval by path.
    +1/-0     
    FileApi.md
    Add documentation for get_file_by_path method                       

    py/packages/corpora_client/docs/FileApi.md - Added documentation for the `get_file_by_path` API method.
    +81/-0   

    ๐Ÿ’ก PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    github-actions[bot] commented 1 week ago

    PR Reviewer Guide ๐Ÿ”

    Here are some key observations to aid the review process:

    โฑ๏ธ Estimated effort to review: 3 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšชโšช
    ๐Ÿงช PR contains tests
    ๐Ÿ”’ No security concerns identified
    โšก Recommended focus areas for review

    Code Smell
    The `get_file_by_path` function does not handle the case where the file is not found, which might lead to unhandled exceptions. Code Smell
    The `search` command has commented-out code that should be removed or properly implemented. Code Smell
    The `get_file_by_path` methods have a lot of repeated code. Consider refactoring to reduce duplication.
    github-actions[bot] commented 1 week ago

    PR Code Suggestions โœจ

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Implement error handling for file retrieval to manage potential exceptions ___ **Add error handling for the c.file_api.get_file(split.file_id) call to manage
    potential exceptions and ensure robustness.** [py/packages/corpora_cli/commands/split.py [29-30]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R29-R30) ```diff -phial = c.file_api.get_file(split.file_id) -c.console.print(f"File: {phial.path}") +try: + phial = c.file_api.get_file(split.file_id) + c.console.print(f"File: {phial.path}") +except Exception as e: + c.console.print(f"Error retrieving file: {e}", style="bold red") ```
    Suggestion importance[1-10]: 8 Why: Adding error handling for the file retrieval process is crucial for managing exceptions that may occur, such as network issues or invalid file IDs. This suggestion significantly improves the robustness and user feedback of the application.
    8
    Add error handling for database retrieval to manage potential exceptions ___ **Consider adding error handling for the
    CorpusTextFile.objects.select_related("corpus").aget call to manage potential
    exceptions like database access issues.** [py/packages/corpora/routers/corpustextfile.py [50-53]](https://github.com/skyl/corpora/pull/19/files#diff-76b598a80f83577973d8fd235d1c9bde05a20e944ac96b5e438fc62f28338146R50-R53) ```diff -ctf = await CorpusTextFile.objects.select_related("corpus").aget( - corpus__id=corpus_id, path=path -) -return ctf +try: + ctf = await CorpusTextFile.objects.select_related("corpus").aget( + corpus__id=corpus_id, path=path + ) + return ctf +except Exception as e: + raise HttpError(500, f"Error retrieving file: {e}") ```
    Suggestion importance[1-10]: 8 Why: The suggestion to add error handling for the database retrieval operation is important for managing exceptions like database access issues. This enhances the application's reliability by providing meaningful error messages to the user in case of failures.
    8
    Possible issue
    Handle the case where the search result is empty to prevent iteration issues ___ **Consider handling the case where c.split_api.vector_search(query) returns an empty
    list to avoid potential issues when iterating over res.** [py/packages/corpora_cli/commands/split.py [26-31]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R26-R31) ```diff res = c.split_api.vector_search(query) +if not res: + c.console.print("No splits found.") for split in res: phial = c.file_api.get_file(split.file_id) c.console.print(f"File: {phial.path}") c.console.print(f"{split.order} {split.content[:100]}", style="dim") ```
    Suggestion importance[1-10]: 7 Why: The suggestion to handle an empty result set from `c.split_api.vector_search(query)` is valuable as it prevents potential runtime errors when iterating over an empty list. This improves the robustness of the code by providing user feedback when no results are found.
    7
    Possible bug
    Validate the corpus ID from the configuration to ensure it is valid before using it in the search query ___ **Ensure that the corpus_id used in c.config["id"] is valid and exists to prevent
    potential errors during the search operation.** [py/packages/corpora_cli/commands/split.py [20-24]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R20-R24) ```diff +if "id" not in c.config or not c.config["id"]: + raise ValueError("Invalid or missing corpus ID in configuration.") query = SplitVectorSearchSchema( corpus_id=c.config["id"], text=text, limit=limit, ) ```
    Suggestion importance[1-10]: 6 Why: Validating the `corpus_id` ensures that the search operation does not proceed with an invalid or missing ID, which could lead to errors. This suggestion enhances the reliability of the code by preventing potential misconfigurations.
    6
    skyl commented 1 week ago

    /describe

    github-actions[bot] commented 1 week ago

    PR Description updated to latest commit (https://github.com/skyl/corpora/commit/dc9247ce135c17a822e02606310437c470f4275d)