feat(CLI): search splits

skyl commented 1 week ago

PR Type

Enhancement, Tests

Description

Added a new API endpoint to retrieve files by path within a corpus, enhancing the file retrieval capabilities.
Implemented CLI commands for searching and listing splits, providing users with more flexible data access.
Added comprehensive tests for the new API endpoint and CLI commands, ensuring robustness and reliability.
Updated project configuration and documentation to reflect the new features and changes.

Changes walkthrough 📝

Relevant files

Enhancement

4 files

corpustextfile.py `Add endpoint to retrieve files by path in corpus` py/packages/corpora/routers/corpustextfile.py Added a new endpoint to retrieve files by path within a corpus. Introduced query parameter handling for file path retrieval.	+27/-9
split.py `Implement CLI commands for split operations` py/packages/corpora_cli/commands/split.py Implemented CLI commands for searching and listing splits. Added error handling for missing arguments and invalid limits.	+42/-0
main.py `Register split commands in CLI` py/packages/corpora_cli/main.py - Registered new split commands in the CLI application.	+2/-1
file_api.py `Add API client methods for file retrieval by path` py/packages/corpora_client/api/file_api.py Added methods to retrieve files by path in the API client. Included serialization and deserialization logic for new endpoint.	+261/-1

Tests

3 files

test_corpustextfile.py `Add tests for file retrieval by path` py/packages/corpora/routers/test_corpustextfile.py Added tests for retrieving files by path. Tested scenarios for non-existent files and missing query parameters.	+45/-0
test_split.py `Add tests for CLI split commands` py/packages/corpora_cli/commands/test_split.py Added tests for CLI split search and list commands. Included tests for invalid input scenarios.	+139/-0
test_file_api.py `Add test case for file retrieval by path` py/packages/corpora_client/test/test_file_api.py - Added a placeholder test case for file retrieval by path.	+7/-0

Configuration changes

1 files

.corpora.yaml

Update project metadata in configuration

.corpora.yaml - Updated project ID and name.

+3/-4

Documentation

2 files

README.md `Document new API method for file retrieval by path` py/packages/corpora_client/README.md - Documented the new API method for file retrieval by path.	+1/-0
FileApi.md `Add documentation for get_file_by_path method` py/packages/corpora_client/docs/FileApi.md - Added documentation for the `get_file_by_path` API method.	+81/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions[bot] commented 1 week ago

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪

🧪 PR contains tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Code Smell
The `get_file_by_path` function does not handle the case where the file is not found, which might lead to unhandled exceptions. Code Smell
The `search` command has commented-out code that should be removed or properly implemented. Code Smell
The `get_file_by_path` methods have a lot of repeated code. Consider refactoring to reduce duplication.

github-actions[bot] commented 1 week ago

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Best practice	Implement error handling for file retrieval to manage potential exceptions ___ Add error handling for the `c.file_api.get_file(split.file_id)` call to manage potential exceptions and ensure robustness. [py/packages/corpora_cli/commands/split.py [29-30]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R29-R30) ```diff -phial = c.file_api.get_file(split.file_id) -c.console.print(f"File: {phial.path}") +try: + phial = c.file_api.get_file(split.file_id) + c.console.print(f"File: {phial.path}") +except Exception as e: + c.console.print(f"Error retrieving file: {e}", style="bold red") ``` Suggestion importance[1-10]: 8 Why: Adding error handling for the file retrieval process is crucial for managing exceptions that may occur, such as network issues or invalid file IDs. This suggestion significantly improves the robustness and user feedback of the application.	8
Best practice	Add error handling for database retrieval to manage potential exceptions ___ Consider adding error handling for the `CorpusTextFile.objects.select_related("corpus").aget` call to manage potential exceptions like database access issues. [py/packages/corpora/routers/corpustextfile.py [50-53]](https://github.com/skyl/corpora/pull/19/files#diff-76b598a80f83577973d8fd235d1c9bde05a20e944ac96b5e438fc62f28338146R50-R53) ```diff -ctf = await CorpusTextFile.objects.select_related("corpus").aget( - corpus__id=corpus_id, path=path -) -return ctf +try: + ctf = await CorpusTextFile.objects.select_related("corpus").aget( + corpus__id=corpus_id, path=path + ) + return ctf +except Exception as e: + raise HttpError(500, f"Error retrieving file: {e}") ``` Suggestion importance[1-10]: 8 Why: The suggestion to add error handling for the database retrieval operation is important for managing exceptions like database access issues. This enhances the application's reliability by providing meaningful error messages to the user in case of failures.	8
Possible issue	Handle the case where the search result is empty to prevent iteration issues ___ Consider handling the case where `c.split_api.vector_search(query)` returns an empty list to avoid potential issues when iterating over `res`. [py/packages/corpora_cli/commands/split.py [26-31]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R26-R31) ```diff res = c.split_api.vector_search(query) +if not res: + c.console.print("No splits found.") for split in res: phial = c.file_api.get_file(split.file_id) c.console.print(f"File: {phial.path}") c.console.print(f"{split.order} {split.content[:100]}", style="dim") ``` Suggestion importance[1-10]: 7 Why: The suggestion to handle an empty result set from `c.split_api.vector_search(query)` is valuable as it prevents potential runtime errors when iterating over an empty list. This improves the robustness of the code by providing user feedback when no results are found.	7
Possible bug	Validate the corpus ID from the configuration to ensure it is valid before using it in the search query ___ Ensure that the `corpus_id` used in `c.config["id"]` is valid and exists to prevent potential errors during the search operation. [py/packages/corpora_cli/commands/split.py [20-24]](https://github.com/skyl/corpora/pull/19/files#diff-05454f1a232d5d3506ccbd1c61a8074269ec8cc48e9d7f8f2f8639f1787708a0R20-R24) ```diff +if "id" not in c.config or not c.config["id"]: + raise ValueError("Invalid or missing corpus ID in configuration.") query = SplitVectorSearchSchema( corpus_id=c.config["id"], text=text, limit=limit, ) ``` Suggestion importance[1-10]: 6 Why: Validating the `corpus_id` ensures that the search operation does not proceed with an invalid or missing ID, which could lead to errors. This suggestion enhances the reliability of the code by preventing potential misconfigurations.	6

skyl commented 1 week ago

/describe

github-actions[bot] commented 1 week ago

PR Description updated to latest commit (https://github.com/skyl/corpora/commit/dc9247ce135c17a822e02606310437c470f4275d)

skyl / corpora