skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

feat(sync): work on new corpus flow - sync and init #26

Closed skyl closed 1 week ago

skyl commented 1 week ago

PR Type

enhancement, tests, documentation


Description


Changes walkthrough πŸ“

Relevant files
Enhancement
8 files
models.py
Add methods for file hash retrieval and deletion in Corpus model

py/packages/corpora/models.py
  • Added get_file_hashes method to retrieve file hashes.
  • Added delete_files method to delete files by path.
  • +13/-0   
    corpus.py
    Implement endpoints for updating and retrieving corpus files

    py/packages/corpora/routers/corpus.py
  • Added update_files endpoint to update corpus files.
  • Added get_file_hashes endpoint to retrieve file hashes.
  • +44/-1   
    tasks.py
    Enhance process_tarball to update existing files                 

    py/packages/corpora/tasks.py - Modified `process_tarball` to update existing files.
    +6/-4     
    corpus.py
    Add sync command and enhance init command in CLI                 

    py/packages/corpora_cli/commands/corpus.py
  • Added sync command to synchronize corpus files.
  • Enhanced init command to save corpus ID.
  • +92/-6   
    config.py
    Add config saving and enhance loading with Git defaults   

    py/packages/corpora_cli/config.py
  • Added save_config function to save configuration.
  • Enhanced load_config to infer defaults from Git.
  • +31/-9   
    git.py
    Add utility functions for Git operations                                 

    py/packages/corpora_cli/utils/git.py - Added utility functions for Git operations.
    +58/-0   
    corpus_api.py
    Add methods for file hash retrieval and update in Corpus API

    py/packages/corpora_client/api/corpus_api.py - Added `get_file_hashes` and `update_files` methods.
    +530/-1 
    update_corpus_schema.py
    Add UpdateCorpusSchema model for corpus file updates         

    py/packages/corpora_client/models/update_corpus_schema.py - Added `UpdateCorpusSchema` model for updating corpus files.
    +88/-0   
    Tests
    5 files
    test_corpus.py
    Add test for updating corpus files endpoint                           

    py/packages/corpora/routers/test_corpus.py - Added test for `update_files` endpoint.
    +34/-3   
    test_models_corpus.py
    Add tests for file hash retrieval and deletion in Corpus model

    py/packages/corpora/test_models_corpus.py - Added tests for `get_file_hashes` and `delete_files` methods.
    +90/-0   
    test_corpus.py
    Enhance init command test to verify config saving               

    py/packages/corpora_cli/commands/test_corpus.py - Enhanced `test_init_command` to verify config saving.
    +40/-4   
    test_corpus_api.py
    Add test stubs for file hash retrieval and update in Corpus API

    py/packages/corpora_client/test/test_corpus_api.py - Added test stubs for `get_file_hashes` and `update_files`.
    +14/-0   
    test_update_corpus_schema.py
    Add unit tests for UpdateCorpusSchema model                           

    py/packages/corpora_client/test/test_update_corpus_schema.py - Added unit tests for `UpdateCorpusSchema`.
    +55/-0   
    Documentation
    2 files
    CorpusApi.md
    Document get_file_hashes and update_files API methods       

    py/packages/corpora_client/docs/CorpusApi.md - Documented `get_file_hashes` and `update_files` API methods.
    +161/-0 
    UpdateCorpusSchema.md
    Add documentation for UpdateCorpusSchema model                     

    py/packages/corpora_client/docs/UpdateCorpusSchema.md - Added documentation for `UpdateCorpusSchema`.
    +29/-0   

    πŸ’‘ PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    github-actions[bot] commented 1 week ago

    PR Reviewer Guide πŸ”

    (Review updated until commit https://github.com/skyl/corpora/commit/d5957a0dabec908f0364f1d64cf3f1322dfda07d)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 πŸ”΅πŸ”΅πŸ”΅πŸ”΅βšͺ
    πŸ§ͺ PR contains tests
    πŸ”’ No security concerns identified
    ⚑ Recommended focus areas for review

    Code Smell
    The `get_file_hashes` method lacks type annotations for its return value. Consider adding type hints for better code readability and maintenance. Code Smell
    The `update_files` function uses `print` statements for logging. Consider using a logging framework for better control over log levels and outputs. Code Smell
    The `sync` function has a complex logic flow with multiple responsibilities. Consider refactoring to improve readability and maintainability.
    github-actions[bot] commented 1 week ago

    PR Code Suggestions ✨

    Latest suggestions up to d5957a0 Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Format the delete_files parameter as a CSV string when appending to form parameters ___ **Ensure that the delete_files parameter is properly formatted as a CSV string when
    appending to _form_params.** [py/packages/corpora_client/api/corpus_api.py [1533-1534]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR1533-R1534) ```diff -if delete_files is not None: - _form_params.append(("delete_files", delete_files)) +if delete_files: + _form_params.append(("delete_files", ','.join(delete_files))) ```
    Suggestion importance[1-10]: 9 Why: Ensuring the `delete_files` parameter is formatted as a CSV string is essential for correct API request formation. This suggestion addresses a potential bug that could lead to incorrect data being sent to the server, thus improving the correctness of the code.
    9
    Enhancement
    Add error handling for API calls to improve robustness ___ **Consider adding error handling for the API call to manage potential exceptions and
    improve robustness.** [py/packages/corpora_client/api/corpus_api.py [872-874]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR872-R874) ```diff -response_data = self.api_client.call_api( - *_param, _request_timeout=_request_timeout -) +try: + response_data = self.api_client.call_api( + *_param, _request_timeout=_request_timeout + ) +except Exception as e: + handle_error(e) ```
    Suggestion importance[1-10]: 8 Why: Adding error handling around API calls is crucial for robustness, as it allows the system to gracefully handle unexpected issues during network operations. This suggestion significantly improves the code's resilience to runtime errors.
    8
    Add a network connectivity check before uploading files in the sync function ___ **Ensure that the sync function checks for network connectivity before attempting to
    upload files to prevent errors during the upload process.** [py/packages/corpora_cli/commands/corpus.py [115-122]](https://github.com/skyl/corpora/pull/26/files#diff-c04094e69a69881444f8b97d9fee7e0d29683f5e062cb4d2cf4530bc08daedf7R115-R122) ```diff -c.console.print("Uploading tarball...") -c.corpus_api.update_files( - corpus_id=corpus_id, - tarball=tarball, - delete_files=[str(file) for file in files_to_delete] or None, -) +if check_network_connectivity(): + c.console.print("Uploading tarball...") + c.corpus_api.update_files( + corpus_id=corpus_id, + tarball=tarball, + delete_files=[str(file) for file in files_to_delete] or None, + ) ```
    Suggestion importance[1-10]: 5 Why: Adding a network connectivity check before uploading files can prevent errors during the upload process, enhancing the function's reliability. However, the suggestion is not directly actionable without defining the `check_network_connectivity` function.
    5
    Possible issue
    Add error handling to manage non-existent file paths in the delete_files method ___ **Consider adding error handling for the delete_files method to manage cases where
    file paths in the list do not exist in the database, to prevent potential runtime
    errors.** [py/packages/corpora/models.py [92]](https://github.com/skyl/corpora/pull/26/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R92-R92) ```diff -self.files.filter(path__in=files).delete() +if files: + existing_files = self.files.filter(path__in=files) + if existing_files.exists(): + existing_files.delete() ```
    Suggestion importance[1-10]: 7 Why: The suggestion enhances the robustness of the `delete_files` method by adding error handling for non-existent file paths, which can prevent potential runtime errors and improve the method's reliability.
    7
    Validate that the tarball is not empty before proceeding with the update ___ **Validate that tarball is not empty before proceeding with the update to avoid
    unnecessary API calls.** [py/packages/corpora_client/api/corpus_api.py [1535-1536]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR1535-R1536) ```diff -if tarball is not None: +if tarball: _files["tarball"] = tarball ```
    Suggestion importance[1-10]: 6 Why: Checking if the tarball is not empty before proceeding prevents unnecessary API calls and potential errors. This suggestion improves the efficiency and reliability of the update operation.
    6
    Best practice
    Ensure proper closure of response data to prevent resource leaks ___ **Ensure that the response_data is properly closed or released after reading to
    prevent potential resource leaks.** [py/packages/corpora_client/api/corpus_api.py [875]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR875-R875) ```diff response_data.read() +response_data.close() ```
    Suggestion importance[1-10]: 7 Why: Closing the response data after reading is a good practice to prevent resource leaks, especially in network operations. This suggestion enhances the robustness of the code by ensuring resources are properly released.
    7
    Add exception handling to the run_command function for better error management ___ **Handle exceptions in the run_command function to provide more informative error
    messages and prevent the application from crashing if a subprocess call fails.** [py/packages/corpora_cli/utils/git.py [10-12]](https://github.com/skyl/corpora/pull/26/files#diff-f67a5ba9dbf7cd4759dde33fb7883d6ef0b1cea1c4cc10184f05287fec4b4f49R10-R12) ```diff -result = subprocess.run( - command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True -) +try: + result = subprocess.run( + command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, check=True + ) +except subprocess.CalledProcessError as e: + return f"Error executing command: {e}" ```
    Suggestion importance[1-10]: 6 Why: The suggestion improves error handling in the `run_command` function, providing more informative error messages and preventing application crashes due to subprocess call failures. This enhances the function's robustness.
    6
    Add a validation check to ensure only valid YAML data is written in the save_config function ___ **Consider adding a check to ensure that the save_config function only writes valid
    YAML data to prevent corrupting the configuration file.** [py/packages/corpora_cli/config.py [52-53]](https://github.com/skyl/corpora/pull/26/files#diff-d5d4ce34a5030133a9b563f560c601dd7d497a186be5bbf0c12a248ea466c948R52-R53) ```diff -with open(CONFIG_FILE_PATH, "w") as file: - yaml.safe_dump(config, file) +if isinstance(config, dict): + with open(CONFIG_FILE_PATH, "w") as file: + yaml.safe_dump(config, file) ```
    Suggestion importance[1-10]: 4 Why: The suggestion adds a basic validation check to ensure that only valid YAML data is written, which can prevent potential file corruption. However, the check is minimal and may not cover all edge cases of invalid data.
    4

    Previous suggestions

    Suggestions up to commit 0fb6816
    CategorySuggestion                                                                                                                                    Score
    Bug
    Correct the method used to check file existence in the get_local_files function ___ **Modify the get_local_files function to check if the path exists using
    os.path.exists() instead of path.exists(), as path is a string and does not have an
    exists method.** [py/packages/corpora_cli/utils/git.py [31]](https://github.com/skyl/corpora/pull/26/files#diff-f67a5ba9dbf7cd4759dde33fb7883d6ef0b1cea1c4cc10184f05287fec4b4f49R31-R31) ```diff -if path.exists(): +import os +if os.path.exists(path): local_files[path] = get_file_hash(path) ```
    Suggestion importance[1-10]: 9 Why: The suggestion correctly identifies a bug where a string method is incorrectly used. Fixing this bug is essential for the correct functionality of the `get_local_files` function, making this suggestion highly impactful.
    9
    Possible issue
    Add exception handling for API calls in the sync function to improve robustness ___ **Ensure that the sync function handles potential exceptions from the get_file_hashes
    and delete_files API calls to prevent the application from crashing if the server is
    unreachable or returns an error.** [py/packages/corpora_cli/commands/corpus.py [79-108]](https://github.com/skyl/corpora/pull/26/files#diff-c04094e69a69881444f8b97d9fee7e0d29683f5e062cb4d2cf4530bc08daedf7R79-R108) ```diff -remote_files_map = c.corpus_api.get_file_hashes(corpus_id) +try: + remote_files_map = c.corpus_api.get_file_hashes(corpus_id) +except ApiException as e: + # Handle the exception ... -c.corpus_api.delete_files(corpus_id, files_to_delete) +try: + c.corpus_api.delete_files(corpus_id, files_to_delete) +except ApiException as e: + # Handle the exception ```
    Suggestion importance[1-10]: 8 Why: Adding exception handling for API calls in the `sync` function is crucial for preventing application crashes due to server errors or unreachability. This suggestion significantly enhances the robustness of the code.
    8
    Add error handling for non-existent file paths in the delete_files method ___ **Consider adding error handling for the delete_files method to handle cases where the
    file paths provided do not exist in the database, which could prevent potential
    runtime errors.** [py/packages/corpora/models.py [92]](https://github.com/skyl/corpora/pull/26/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R92-R92) ```diff -self.files.filter(path__in=files).delete() +if not self.files.filter(path__in=files).exists(): + # Handle the error or log a warning +else: + self.files.filter(path__in=files).delete() ```
    Suggestion importance[1-10]: 7 Why: Adding error handling for non-existent file paths in the `delete_files` method can prevent runtime errors and improve the robustness of the code. This suggestion is relevant and accurately addresses a potential issue in the PR.
    7
    Validate the request_body parameter to ensure data integrity ___ **Ensure that the request_body parameter is validated to prevent unexpected data from
    being processed.** [py/packages/corpora_client/api/corpus_api.py [576-579]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR576-R579) ```diff +def delete_files( + self, + corpus_id: StrictStr, + request_body: List[StrictStr], - ```
    Suggestion importance[1-10]: 7 Why: Ensuring the `request_body` parameter is validated helps maintain data integrity and prevents unexpected data from being processed. This suggestion is important for maintaining the correctness and reliability of the function.
    7
    Validate corpus_id and request_body parameters to ensure they are correct before API calls ___ **Verify that the corpus_id and request_body parameters are correctly validated before
    making the API call to prevent invalid requests.** [py/packages/corpora_client/docs/CorpusApi.md [214-215]](https://github.com/skyl/corpora/pull/26/files#diff-b4f8c98fe304fa457fbd86826fa70a6260ef956625a70bdf4760481d4cd47c20R214-R215) ```diff -corpus_id = 'corpus_id_example' # str | -request_body = ['request_body_example'] # List[str] | +corpus_id = 'valid_corpus_id' # Ensure this is a valid corpus ID +request_body = ['valid_request_body'] # Ensure this contains valid file paths ```
    Suggestion importance[1-10]: 6 Why: Validating the `corpus_id` and `request_body` parameters before making API calls is a good practice to prevent invalid requests. This suggestion improves the reliability of the API interactions by ensuring that the parameters are correct.
    6
    Security
    Validate the corpus_id parameter to prevent potential security vulnerabilities ___ **Ensure that the corpus_id parameter is properly validated to prevent potential
    security issues such as injection attacks.** [py/packages/corpora_client/api/corpus_api.py [576-579]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR576-R579) ```diff +def delete_files( + self, + corpus_id: StrictStr, + request_body: List[StrictStr], - ```
    Suggestion importance[1-10]: 8 Why: Validating the `corpus_id` parameter is crucial for preventing security vulnerabilities such as injection attacks. This suggestion addresses a potential security concern, making it highly relevant and impactful.
    8
    Best practice
    Add error handling for API calls to improve robustness ___ **Consider adding error handling for the call_api method to manage potential
    exceptions and ensure robustness.** [py/packages/corpora_client/api/corpus_api.py [635-636]](https://github.com/skyl/corpora/pull/26/files#diff-d022f050737523ecaf2769d11bbb5a916e2317faf95579b98c4f334cd687d08aR635-R636) ```diff -response_data = self.api_client.call_api( - *_param, _request_timeout=_request_timeout -) +try: + response_data = self.api_client.call_api( + *_param, _request_timeout=_request_timeout + ) +except Exception as e: + # handle exception ```
    Suggestion importance[1-10]: 7 Why: Adding error handling for the `call_api` method enhances the robustness of the code by managing potential exceptions. This is a best practice that can prevent runtime errors and improve the reliability of the application.
    7
    Possible bug
    Initialize the tarball variable with a valid bytearray to avoid errors ___ **Ensure that the tarball variable is properly initialized with a valid bytearray
    before being passed to the update_files method to prevent potential errors.** [py/packages/corpora_client/docs/CorpusApi.md [524]](https://github.com/skyl/corpora/pull/26/files#diff-b4f8c98fe304fa457fbd86826fa70a6260ef956625a70bdf4760481d4cd47c20R524-R524) ```diff -tarball = None # bytearray | +tarball = bytearray() # Initialize with an empty bytearray or a valid bytearray ```
    Suggestion importance[1-10]: 7 Why: Initializing the `tarball` variable with a valid bytearray is crucial to prevent runtime errors when calling the `update_files` method. This suggestion addresses a potential bug by ensuring the variable is properly initialized, which enhances the robustness of the code.
    7
    Enhancement
    Improve error handling in get_git_remote_url by logging command failures ___ **Add error handling for the get_git_remote_url function to log or notify when the Git
    command fails, instead of silently returning an empty string, to aid in debugging.** [py/packages/corpora_cli/config.py [49-50]](https://github.com/skyl/corpora/pull/26/files#diff-d5d4ce34a5030133a9b563f560c601dd7d497a186be5bbf0c12a248ea466c948R49-R50) ```diff -except subprocess.CalledProcessError: - return "" # Return an empty string if Git command fails +except subprocess.CalledProcessError as e: + # Log the error or notify the user + return "" ```
    Suggestion importance[1-10]: 6 Why: While the suggestion to log errors in `get_git_remote_url` can aid in debugging, it is not critical to the functionality of the code. However, it does enhance the code's maintainability by providing better error visibility.
    6
    skyl commented 1 week ago

    /describe

    skyl commented 1 week ago

    /review

    github-actions[bot] commented 1 week ago

    Persistent review updated to latest commit https://github.com/skyl/corpora/commit/d5957a0dabec908f0364f1d64cf3f1322dfda07d

    github-actions[bot] commented 1 week ago

    PR Description updated to latest commit (https://github.com/skyl/corpora/commit/d5957a0dabec908f0364f1d64cf3f1322dfda07d)

    github-actions[bot] commented 1 week ago

    Persistent review updated to latest commit https://github.com/skyl/corpora/commit/d5957a0dabec908f0364f1d64cf3f1322dfda07d