skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

feat(routers): reorg, expand, start split search #16

Closed skyl closed 2 weeks ago

skyl commented 2 weeks ago

PR Type

enhancement, tests


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
14 files
splits_api.py
Introduce SplitsApi for managing split operations               

py/packages/corpora_client/api/splits_api.py
  • Added a new SplitsApi class for managing split-related API calls.
  • Implemented methods for retrieving, listing, and searching splits.
  • Utilized Pydantic for data validation and serialization.
  • +775/-0 
    corpus_api.py
    Refactor CorpusApi and update method names                             

    py/packages/corpora_client/api/corpus_api.py
  • Renamed CorporaApi to CorpusApi.
  • Updated method names to reflect new router structure.
  • Removed file-related methods to separate API files.
  • +32/-528
    files_api.py
    Introduce FilesApi for managing file operations                   

    py/packages/corpora_client/api/files_api.py
  • Added a new FilesApi class for managing file-related API calls.
  • Implemented methods for creating and retrieving files.
  • Utilized Pydantic for data validation and serialization.
  • +535/-0 
    split_vector_search_schema.py
    Add SplitVectorSearchSchema model for vector search           

    py/packages/corpora_client/models/split_vector_search_schema.py
  • Added SplitVectorSearchSchema model for vector search operations.
  • Defined fields for corpus ID, vector, and limit.
  • Included methods for JSON and dictionary serialization.
  • +91/-0   
    split_response_schema.py
    Add SplitResponseSchema model for split responses               

    py/packages/corpora_client/models/split_response_schema.py
  • Added SplitResponseSchema model for split response handling.
  • Defined fields for ID, content, order, and file ID.
  • Included methods for JSON and dictionary serialization.
  • +93/-0   
    file_response_schema.py
    Update FileResponseSchema to include corpus_id                     

    py/packages/corpora_client/models/file_response_schema.py
  • Added corpus_id field to FileResponseSchema.
  • Removed corpus-related fields and methods.
  • +3/-11   
    corpus.py
    Add corpus router with CRUD operations                                     

    py/packages/corpora/routers/corpus.py
  • Added new router for corpus operations.
  • Implemented endpoints for creating, deleting, and listing corpora.
  • Utilized Django async ORM and Ninja framework.
  • +64/-0   
    split.py
    Add split router with retrieval and search operations       

    py/packages/corpora/routers/split.py
  • Added new router for split operations.
  • Implemented endpoints for retrieving and listing splits.
  • Added vector search functionality using cosine similarity.
  • +47/-0   
    corpustextfile.py
    Add file router with CRUD operations                                         

    py/packages/corpora/routers/corpustextfile.py
  • Added new router for file operations within a corpus.
  • Implemented endpoints for creating and retrieving files.
  • Utilized Django async ORM and Ninja framework.
  • +41/-0   
    __init__.py
    Update package imports for new API structure                         

    py/packages/corpora_client/__init__.py
  • Updated imports to include new API classes and models.
  • Removed deprecated CorporaApi import.
  • +5/-1     
    schema.py
    Update schema definitions for split and file operations   

    py/packages/corpora/schema.py
  • Added SplitVectorSearchSchema and SplitResponseSchema to schema
    definitions.
  • Updated FileResponseSchema to include corpus_id.
  • +17/-1   
    __init__.py
    Update model imports for split schemas                                     

    py/packages/corpora_client/models/__init__.py - Updated model imports to include split-related schemas.
    +2/-0     
    router.py
    Add main router to aggregate API endpoints                             

    py/packages/corpora/router.py
  • Added main router to aggregate corpus, file, and split routers.
  • Utilized Ninja framework for API routing.
  • +12/-0   
    __init__.py
    Update API imports for new structure                                         

    py/packages/corpora_client/api/__init__.py
  • Updated API imports to include new CorpusApi, FilesApi, and SplitsApi.

  • +3/-1     
    Tests
    10 files
    test_corpus.py
    Add test cases for Corpus API operations                                 

    py/packages/corpora/routers/test_corpus.py
  • Added test cases for creating, retrieving, and deleting corpora.
  • Included tests for handling conflicts and not found errors.
  • Utilized Django's test framework and pytest for asynchronous testing.
  • +115/-0 
    test_split.py
    Add test cases for Split API operations                                   

    py/packages/corpora/routers/test_split.py
  • Added test cases for split retrieval and vector search.
  • Utilized Django's test framework and pytest for asynchronous testing.
  • Tested vector similarity search functionality.
  • +64/-0   
    test_corpustextfile.py
    Add test cases for CorpusTextFile API operations                 

    py/packages/corpora/routers/test_corpustextfile.py
  • Added test cases for file creation and retrieval.
  • Included tests for handling duplicate paths and not found errors.
  • Utilized Django's test framework and pytest for asynchronous testing.
  • +75/-0   
    test_file_response_schema.py
    Update FileResponseSchema tests to include corpus_id         

    py/packages/corpora_client/test/test_file_response_schema.py
  • Updated test cases to include corpus_id in FileResponseSchema.
  • Removed corpus-related fields from the test setup.
  • +3/-13   
    test_split_vector_search_schema.py
    Add unit tests for SplitVectorSearchSchema                             

    py/packages/corpora_client/test/test_split_vector_search_schema.py
  • Added unit tests for SplitVectorSearchSchema.
  • Tested JSON and dictionary serialization methods.
  • +61/-0   
    test_lib.py
    Add test library for creating test data                                   

    py/packages/corpora/routers/test_lib.py
  • Added helper functions for creating users, corpora, files, and splits.
  • Utilized Django async ORM and OAuth2 for authentication.
  • +48/-0   
    test_split_response_schema.py
    Add unit tests for SplitResponseSchema                                     

    py/packages/corpora_client/test/test_split_response_schema.py
  • Added unit tests for SplitResponseSchema.
  • Tested JSON and dictionary serialization methods.
  • +60/-0   
    test_corpus_api.py
    Add unit tests for CorpusApi                                                         

    py/packages/corpora_client/test/test_corpus_api.py
  • Added unit tests for CorpusApi.
  • Included test stubs for CRUD operations.
  • +59/-0   
    test_splits_api.py
    Add unit tests for SplitsApi                                                         

    py/packages/corpora_client/test/test_splits_api.py
  • Added unit tests for SplitsApi.
  • Included test stubs for split-related operations.
  • +52/-0   
    test_files_api.py
    Add unit tests for FilesApi                                                           

    py/packages/corpora_client/test/test_files_api.py
  • Added unit tests for FilesApi.
  • Included test stubs for file-related operations.
  • +45/-0   
    Configuration changes
    1 files
    urls.py
    Update URL routing to use new corpora router                         

    py/packages/corpora_proj/urls.py - Updated import to use new `corpora_router`.
    +1/-1     
    Documentation
    1 files
    FileResponseSchema.md
    Update FileResponseSchema documentation                                   

    py/packages/corpora_client/docs/FileResponseSchema.md
  • Updated documentation to include corpus_id in FileResponseSchema.
  • Removed corpus field from the schema documentation.
  • +1/-1     
    Additional files (token-limit)
    7 files
    CorpusApi.md
    ...                                                                                                           

    py/packages/corpora_client/docs/CorpusApi.md ...
    +30/-189
    SplitsApi.md
    ...                                                                                                           

    py/packages/corpora_client/docs/SplitsApi.md ...
    +246/-0 
    FilesApi.md
    ...                                                                                                           

    py/packages/corpora_client/docs/FilesApi.md ...
    +168/-0 
    README.md
    ...                                                                                                           

    py/packages/corpora_client/README.md ...
    +15/-10 
    TODO.md
    ...                                                                                                           

    TODO.md ...
    +19/-19 
    SplitVectorSearchSchema.md
    ...                                                                                                           

    py/packages/corpora_client/docs/SplitVectorSearchSchema.md ...
    +31/-0   
    SplitResponseSchema.md
    ...                                                                                                           

    py/packages/corpora_client/docs/SplitResponseSchema.md ...
    +32/-0   

    ๐Ÿ’ก PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    github-actions[bot] commented 2 weeks ago

    PR Reviewer Guide ๐Ÿ”

    Here are some key observations to aid the review process:

    โฑ๏ธ Estimated effort to review: 4 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšช
    ๐Ÿงช PR contains tests
    ๐Ÿ”’ No security concerns identified
    โšก Recommended focus areas for review

    Code Smell
    The method names in `CorpusApi` have become quite verbose after the refactoring. Consider simplifying the method names for better readability and maintainability. Data Model Change
    The `FileResponseSchema` model has been modified to replace the `corpus` field with `corpus_id`. Ensure that this change is compatible with other parts of the system that rely on this model. New Feature
    The `SplitVectorSearchSchema` model is introduced for vector search functionality. Verify that this new feature is correctly integrated and tested.
    github-actions[bot] commented 2 weeks ago

    PR Code Suggestions โœจ

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Add error handling for missing corpus objects to prevent unhandled exceptions ___ **Add error handling for cases where the corpus object is not found to prevent
    unhandled exceptions.** [py/packages/corpora/routers/corpustextfile.py [20-21]](https://github.com/skyl/corpora/pull/16/files#diff-76b598a80f83577973d8fd235d1c9bde05a20e944ac96b5e438fc62f28338146R20-R21) ```diff -corpus = await Corpus.objects.aget(id=payload.corpus_id) +try: + corpus = await Corpus.objects.aget(id=payload.corpus_id) +except Corpus.DoesNotExist: + raise HttpError(404, "Corpus not found.") checksum = compute_checksum(payload.content) ```
    Suggestion importance[1-10]: 9 Why: This suggestion addresses a potential bug by adding error handling for cases where the corpus object is not found. It prevents unhandled exceptions and improves the reliability of the code by providing clear error messages.
    9
    Possible issue
    Add a validation check to ensure the uploaded tarball is not empty before processing ___ **Consider adding a check to ensure that the tarball file is not empty before
    processing it to prevent unnecessary operations and potential errors.** [py/packages/corpora/routers/corpus.py [27-33]](https://github.com/skyl/corpora/pull/16/files#diff-3ac4611171c3ba2b1e471a17f85e949a5df6e835391d39367db05286216d1762R27-R33) ```diff +if not tarball.size: + raise HttpError(400, "Uploaded tarball is empty.") tarball_content: bytes = await sync_to_async(tarball.read)() try: corpus_instance = await Corpus.objects.acreate( name=corpus.name, url=corpus.url, owner=request.user, ) ```
    Suggestion importance[1-10]: 8 Why: This suggestion adds a crucial validation step to ensure that the uploaded tarball is not empty, which prevents unnecessary operations and potential errors. This enhances the robustness of the code by handling a common edge case.
    8
    Add exception handling for API call to enhance robustness ___ **Consider handling potential exceptions that might occur during the call_api method
    to ensure robustness.** [py/packages/corpora_client/api/files_api.py [96-98]](https://github.com/skyl/corpora/pull/16/files#diff-2dbb7f40c110c352fe705d797c380a367b9088cf0ba3478ce7fb58ea5728f050R96-R98) ```diff -response_data = self.api_client.call_api( - *_param, _request_timeout=_request_timeout -) +try: + response_data = self.api_client.call_api( + *_param, _request_timeout=_request_timeout + ) +except Exception as e: + # Handle exception ```
    Suggestion importance[1-10]: 7 Why: Adding exception handling around the `call_api` method is a good practice to ensure robustness, especially in network operations where failures can occur.
    7
    Performance
    Validate the limit parameter in vector search to prevent performance issues ___ **Ensure that the limit parameter in the vector search is validated to prevent
    excessively large values that could impact performance.** [py/packages/corpora/routers/split.py [28]](https://github.com/skyl/corpora/pull/16/files#diff-4c5d6ba850b922ef33f5226408145c898efb20761388266e1c007942a3d6eec2R28-R28) ```diff -.order_by("similarity")[: payload.limit] +.order_by("similarity")[: min(payload.limit, 100)] ```
    Suggestion importance[1-10]: 7 Why: Limiting the number of results returned by the vector search is a practical measure to prevent performance degradation due to excessively large queries. This suggestion improves the performance and scalability of the application.
    7
    Maintainability
    Remove unnecessary print statements from test cases to keep output clean ___ **Remove the print(response.content) statement as it is unnecessary and could clutter
    test output.** [py/packages/corpora/routers/test_split.py [61]](https://github.com/skyl/corpora/pull/16/files#diff-6585c521dd30b7828fefa859a692f3234395252d229e08a7ea8fec68581d0834R61-R61) ```diff -print(response.content) assert response.status_code == 200 ```
    Suggestion importance[1-10]: 5 Why: Removing unnecessary print statements from test cases helps maintain clean and readable test outputs, which is beneficial for maintainability and debugging.
    5