skyl / corpora

Corpora is a self-building corpus that can help build other arbitrary corpora
GNU Affero General Public License v3.0
2 stars 0 forks source link

feature(embeddings): vectorize corpora corpus #15

Closed skyl closed 2 weeks ago

skyl commented 3 weeks ago

PR Type

enhancement, tests, documentation


Description


Changes walkthrough ๐Ÿ“

Relevant files
Enhancement
10 files
admin.py
Refactor admin fieldsets and remove vector fields               

py/packages/corpora/admin.py
  • Removed vector_of_summary field from CorpusTextFileAdmin.
  • Removed vector field from SplitAdmin.
  • Adjusted fieldsets for better organization.
  • +6/-4     
    0007_alter_split_vector.py
    Migration to update vector field in Split model                   

    py/packages/corpora/migrations/0007_alter_split_vector.py
  • Added migration to alter vector field in Split model.
  • Updated field to use pgvector.django.vector.VectorField with 1536
    dimensions.
  • +24/-0   
    0008_alter_corpustextfile_vector_of_summary.py
    Migration to update vector_of_summary field in CorpusTextFile

    py/packages/corpora/migrations/0008_alter_corpustextfile_vector_of_summary.py
  • Added migration to alter vector_of_summary field in CorpusTextFile
    model.
  • Updated field to use pgvector.django.vector.VectorField with 1536
    dimensions.
  • +24/-0   
    models.py
    Enhance models with vectorization and content splitting   

    py/packages/corpora/models.py
  • Updated vector_of_summary and vector fields to 1536 dimensions.
  • Added methods for summarizing and vectorizing content.
  • Introduced content splitting functionality.
  • +79/-2   
    tasks.py
    Implement tasks for summarization and vectorization           

    py/packages/corpora/tasks.py
  • Added tasks for generating summaries and vectors.
  • Implemented content splitting task.
  • Enhanced tarball processing to trigger new tasks.
  • +38/-14 
    count_tokens.py
    Add token counting utility function                                           

    py/packages/corpora_ai/count_tokens.py
  • Introduced function to count tokens using tiktoken.
  • Supports token counting for specific models.
  • +20/-0   
    llm_interface.py
    Update LLM interface with embedding and summary methods   

    py/packages/corpora_ai/llm_interface.py
  • Renamed generate_embedding to get_embedding.
  • Added method for generating text summaries.
  • +26/-1   
    prompts.py
    Introduce summarization prompt message                                     

    py/packages/corpora_ai/prompts.py - Added system message for summarization prompts.
    +9/-0     
    split.py
    Add utility for text splitting based on file type               

    py/packages/corpora_ai/split.py
  • Added utility to determine appropriate text splitter based on file
    type.
  • Supports Python and Markdown file splitting.
  • +41/-0   
    llm_client.py
    Update method name for embedding generation                           

    py/packages/corpora_ai_openai/llm_client.py - Renamed `generate_embedding` to `get_embedding`.
    +1/-1     
    Tests
    2 files
    test_provider_loader.py
    Enhance test for OpenAI provider loading                                 

    py/packages/corpora_ai/test_provider_loader.py - Updated test to check for missing OpenAI API key.
    +1/-1     
    test_llm_client.py
    Adjust tests for updated embedding method                               

    py/packages/corpora_ai_openai/test_llm_client.py - Updated tests to reflect changes in method names.
    +7/-7     
    Documentation
    2 files
    celery-tasks.md
    Document Celery task methods and usage                                     

    md/notes/celery-tasks.md
  • Added detailed notes on Celery task methods.
  • Included examples and usage notes for task management.
  • +217/-0 
    practical-embeddings-tutorial.md
    Add tutorial on embeddings and dimensionality strategies 

    md/notes/practical-embeddings-tutorial.md
  • Added tutorial on embeddings with text-embedding-3-small.
  • Discussed strategies for different corpora and dimensionality
    trade-offs.
  • +102/-0 
    Dependencies
    1 files
    requirements.txt
    Update requirements with new dependencies                               

    py/requirements.txt - Added `langchain-text-splitters` and `tiktoken` dependencies.
    +3/-0     

    ๐Ÿ’ก PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    github-actions[bot] commented 3 weeks ago

    PR Reviewer Guide ๐Ÿ”

    (Review updated until commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb)

    Here are some key observations to aid the review process:

    โฑ๏ธ Estimated effort to review: 4 ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ตโšช
    ๐Ÿงช PR contains tests
    ๐Ÿ”’ No security concerns identified
    โšก Recommended focus areas for review

    Code Smell
    The method `get_and_save_summary` and `get_and_save_vector_of_summary` in `CorpusTextFile` class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns. Code Smell
    The `process_tarball` function in `tasks.py` directly calls `generate_summary_task` and `split_file_task` for each file. Consider handling exceptions or failures in these tasks to ensure robustness and reliability of the task processing pipeline. Possible Bug
    In `get_text_splitter`, the function defaults to `CharacterTextSplitter` if no specific splitter is found. Ensure that this default behavior is appropriate for all file types that might be processed, as it might lead to unexpected results for unsupported formats.
    github-actions[bot] commented 3 weeks ago

    PR Code Suggestions โœจ

    Latest suggestions up to 5b9209d Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add exception handling for potential errors during LLM provider loading ___ **Consider handling exceptions that might occur during the load_llm_provider call to
    prevent the application from crashing if the provider fails to load.** [py/packages/corpora/models.py [88-89]](https://github.com/skyl/corpora/pull/15/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R88-R89) ```diff -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary ```
    Suggestion importance[1-10]: 8 Why: Adding exception handling for the LLM provider loading is crucial to prevent application crashes and ensure robustness. This suggestion directly addresses a potential runtime issue, making it highly relevant and impactful.
    8
    Enhancement
    Ensure the input text is a valid non-empty string before generating embeddings ___ **Validate that the text parameter is not only non-empty but also a valid string to
    prevent unexpected errors during embedding generation.** [py/packages/corpora_ai_openai/llm_client.py [29-30]](https://github.com/skyl/corpora/pull/15/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R29-R30) ```diff -if not text: - raise ValueError("Input text must not be empty.") +if not isinstance(text, str) or not text.strip(): + raise ValueError("Input text must be a non-empty string.") ```
    Suggestion importance[1-10]: 7 Why: Enhancing the validation of the `text` parameter to check for a valid non-empty string improves robustness and prevents potential errors during embedding generation. This suggestion is a valuable enhancement to the code's reliability.
    7
    Update package versions to the latest stable releases for improved compatibility and security ___ **Consider specifying a more recent version for langchain-text-splitters and tiktoken
    to ensure compatibility with the latest features and security patches.** [py/requirements.txt [8-9]](https://github.com/skyl/corpora/pull/15/files#diff-6aeed0b3fd7a115160945fdabc2be76056d10a8afdea93befd40d1fc05448e7cR8-R9) ```diff -langchain-text-splitters==0.3.2 -tiktoken==0.8.0 +langchain-text-splitters==0.3.3 +tiktoken==0.8.1 ```
    Suggestion importance[1-10]: 5 Why: Updating package versions can enhance compatibility and security by incorporating the latest features and patches. However, the suggestion assumes newer versions exist without verifying their availability or compatibility with the existing codebase, which slightly reduces its impact.
    5

    Previous suggestions

    Suggestions up to commit e632d02
    CategorySuggestion                                                                                                                                    Score
    Possible bug
    Check for None when extracting files from a tarball to prevent errors ___ **Ensure that the tar.extractfile(member) result is checked for None before attempting
    to read, to prevent potential AttributeError.** [py/packages/corpora/tasks.py [16-17]](https://github.com/skyl/corpora/pull/15/files#diff-6a3e6d24567e31ebab17b57d0b74be2bbbc4464b259e794e21ecff1a793c91c1R16-R17) ```diff -file_content = ( - tar.extractfile(member).read().decode("utf-8", errors="replace") -) +extracted_file = tar.extractfile(member) +if extracted_file is not None: + file_content = extracted_file.read().decode("utf-8", errors="replace") ```
    Suggestion importance[1-10]: 8 Why: This suggestion prevents potential `AttributeError` by ensuring that the extracted file is not `None` before attempting to read it. This is a critical fix to avoid runtime errors when processing tarball files.
    8
    Validate the response structure from the OpenAI API to prevent errors ___ **Validate that the response from the OpenAI API contains the expected data structure
    before accessing elements to prevent potential IndexError.** [py/packages/corpora_ai_openai/llm_client.py [29-31]](https://github.com/skyl/corpora/pull/15/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R29-R31) ```diff -return response.data[0].embedding +if response.data and len(response.data) > 0: + return response.data[0].embedding +else: + raise ValueError("Unexpected response structure from OpenAI API") ```
    Suggestion importance[1-10]: 8 Why: Validating the response structure from the OpenAI API before accessing its elements prevents potential `IndexError`, ensuring that the application handles unexpected API responses gracefully. This is an important improvement for error handling.
    8
    Possible issue
    Add exception handling when loading the LLM provider to improve robustness ___ **Consider handling potential exceptions when calling load_llm_provider() to ensure
    the application can gracefully handle any issues with loading the LLM provider.** [py/packages/corpora/models.py [84-86]](https://github.com/skyl/corpora/pull/15/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R84-R86) ```diff -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary ```
    Suggestion importance[1-10]: 7 Why: Adding exception handling when loading the LLM provider increases the robustness of the application by preventing crashes due to unforeseen errors during the provider loading process. This is a valuable enhancement for maintaining application stability.
    7
    Enhancement
    Add logging to track the execution of task chains for better monitoring ___ **Consider logging the start and completion of each task in the task chain to aid in
    debugging and monitoring task execution.** [py/packages/corpora/tasks.py [28-31]](https://github.com/skyl/corpora/pull/15/files#diff-6a3e6d24567e31ebab17b57d0b74be2bbbc4464b259e794e21ecff1a793c91c1R28-R31) ```diff +logger.info(f"Starting task chain for corpus file {corpus_file.id}") chain( generate_summary_task.s(corpus_file.id), split_file_task.s(corpus_file.id), ).apply_async() +logger.info(f"Task chain for corpus file {corpus_file.id} completed") ```
    Suggestion importance[1-10]: 5 Why: Adding logging for task chain execution can aid in debugging and monitoring, providing insights into task progress and completion. However, it is a moderate enhancement as it does not directly affect functionality or correctness.
    5
    skyl commented 2 weeks ago

    /review

    skyl commented 2 weeks ago

    /describe

    github-actions[bot] commented 2 weeks ago

    Persistent review updated to latest commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb

    github-actions[bot] commented 2 weeks ago

    Persistent review updated to latest commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb

    github-actions[bot] commented 2 weeks ago

    PR Description updated to latest commit (https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb)

    skyl commented 2 weeks ago

    I skimped on the tests a little bit but we will harden a bit later once we finish some of the core features and stabilize... or I'll take it in the next PR ...

    The method get_and_save_summary and get_and_save_vector_of_summary in CorpusTextFile class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns.

    I agree ... I'll figure something out better later.