feature(embeddings): vectorize corpora corpus

skyl commented 3 weeks ago

PR Type

enhancement, tests, documentation

Description

Enhanced models to support 1536-dimensional vectors and added methods for summarization and vectorization.
Implemented tasks for processing text files, generating summaries, and vectorizing content.
Added utility functions for text splitting and token counting.
Updated tests to reflect changes in method names and added new tests for OpenAI provider loading.
Documented Celery task methods and provided a comprehensive tutorial on embeddings.
Updated dependencies to include langchain-text-splitters and tiktoken.

Changes walkthrough 📝

Relevant files

Enhancement

10 files

admin.py `Refactor admin fieldsets and remove vector fields` py/packages/corpora/admin.py Removed `vector_of_summary` field from `CorpusTextFileAdmin`. Removed `vector` field from `SplitAdmin`. Adjusted fieldsets for better organization.	+6/-4
0007_alter_split_vector.py `Migration to update vector field in Split model` py/packages/corpora/migrations/0007_alter_split_vector.py Added migration to alter `vector` field in `Split` model. Updated field to use `pgvector.django.vector.VectorField` with 1536 dimensions.	+24/-0
0008_alter_corpustextfile_vector_of_summary.py `Migration to update vector_of_summary field in CorpusTextFile` py/packages/corpora/migrations/0008_alter_corpustextfile_vector_of_summary.py Added migration to alter `vector_of_summary` field in `CorpusTextFile` model. Updated field to use `pgvector.django.vector.VectorField` with 1536 dimensions.	+24/-0
models.py `Enhance models with vectorization and content splitting` py/packages/corpora/models.py Updated `vector_of_summary` and `vector` fields to 1536 dimensions. Added methods for summarizing and vectorizing content. Introduced content splitting functionality.	+79/-2
tasks.py `Implement tasks for summarization and vectorization` py/packages/corpora/tasks.py Added tasks for generating summaries and vectors. Implemented content splitting task. Enhanced tarball processing to trigger new tasks.	+38/-14
count_tokens.py `Add token counting utility function` py/packages/corpora_ai/count_tokens.py Introduced function to count tokens using `tiktoken`. Supports token counting for specific models.	+20/-0
llm_interface.py `Update LLM interface with embedding and summary methods` py/packages/corpora_ai/llm_interface.py Renamed `generate_embedding` to `get_embedding`. Added method for generating text summaries.	+26/-1
prompts.py `Introduce summarization prompt message` py/packages/corpora_ai/prompts.py - Added system message for summarization prompts.	+9/-0
split.py `Add utility for text splitting based on file type` py/packages/corpora_ai/split.py Added utility to determine appropriate text splitter based on file type. Supports Python and Markdown file splitting.	+41/-0
llm_client.py `Update method name for embedding generation` py/packages/corpora_ai_openai/llm_client.py - Renamed `generate_embedding` to `get_embedding`.	+1/-1

Tests

2 files

test_provider_loader.py `Enhance test for OpenAI provider loading` py/packages/corpora_ai/test_provider_loader.py - Updated test to check for missing OpenAI API key.	+1/-1
test_llm_client.py `Adjust tests for updated embedding method` py/packages/corpora_ai_openai/test_llm_client.py - Updated tests to reflect changes in method names.	+7/-7

Documentation

2 files

celery-tasks.md `Document Celery task methods and usage` md/notes/celery-tasks.md Added detailed notes on Celery task methods. Included examples and usage notes for task management.	+217/-0
practical-embeddings-tutorial.md `Add tutorial on embeddings and dimensionality strategies` md/notes/practical-embeddings-tutorial.md Added tutorial on embeddings with `text-embedding-3-small`. Discussed strategies for different corpora and dimensionality trade-offs.	+102/-0

Dependencies

1 files

requirements.txt

Update requirements with new dependencies

py/requirements.txt - Added `langchain-text-splitters` and `tiktoken` dependencies.

+3/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

github-actions[bot] commented 3 weeks ago

PR Reviewer Guide 🔍

(Review updated until commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Code Smell
The method `get_and_save_summary` and `get_and_save_vector_of_summary` in `CorpusTextFile` class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns. Code Smell
The `process_tarball` function in `tasks.py` directly calls `generate_summary_task` and `split_file_task` for each file. Consider handling exceptions or failures in these tasks to ensure robustness and reliability of the task processing pipeline. Possible Bug
In `get_text_splitter`, the function defaults to `CharacterTextSplitter` if no specific splitter is found. Ensure that this default behavior is appropriate for all file types that might be processed, as it might lead to unexpected results for unsupported formats.

github-actions[bot] commented 3 weeks ago

PR Code Suggestions ✨

Latest suggestions up to 5b9209d Explore these optional code suggestions:

Category	Suggestion	Score
Possible issue	Add exception handling for potential errors during LLM provider loading ___ Consider handling exceptions that might occur during the `load_llm_provider` call to prevent the application from crashing if the provider fails to load. [py/packages/corpora/models.py [88-89]](https://github.com/skyl/corpora/pull/15/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R88-R89) ```diff -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary ``` Suggestion importance[1-10]: 8 Why: Adding exception handling for the LLM provider loading is crucial to prevent application crashes and ensure robustness. This suggestion directly addresses a potential runtime issue, making it highly relevant and impactful.	8
Enhancement	Ensure the input text is a valid non-empty string before generating embeddings ___ Validate that the `text` parameter is not only non-empty but also a valid string to prevent unexpected errors during embedding generation. [py/packages/corpora_ai_openai/llm_client.py [29-30]](https://github.com/skyl/corpora/pull/15/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R29-R30) ```diff -if not text: - raise ValueError("Input text must not be empty.") +if not isinstance(text, str) or not text.strip(): + raise ValueError("Input text must be a non-empty string.") ``` Suggestion importance[1-10]: 7 Why: Enhancing the validation of the `text` parameter to check for a valid non-empty string improves robustness and prevents potential errors during embedding generation. This suggestion is a valuable enhancement to the code's reliability.	7
Enhancement	Update package versions to the latest stable releases for improved compatibility and security ___ Consider specifying a more recent version for `langchain-text-splitters` and `tiktoken` to ensure compatibility with the latest features and security patches. [py/requirements.txt [8-9]](https://github.com/skyl/corpora/pull/15/files#diff-6aeed0b3fd7a115160945fdabc2be76056d10a8afdea93befd40d1fc05448e7cR8-R9) ```diff -langchain-text-splitters==0.3.2 -tiktoken==0.8.0 +langchain-text-splitters==0.3.3 +tiktoken==0.8.1 ``` Suggestion importance[1-10]: 5 Why: Updating package versions can enhance compatibility and security by incorporating the latest features and patches. However, the suggestion assumes newer versions exist without verifying their availability or compatibility with the existing codebase, which slightly reduces its impact.	5

Previous suggestions

Suggestions up to commit e632d02

Category	Suggestion	Score
Possible bug	Check for `None` when extracting files from a tarball to prevent errors ___ Ensure that the `tar.extractfile(member)` result is checked for `None` before attempting to read, to prevent potential `AttributeError`. [py/packages/corpora/tasks.py [16-17]](https://github.com/skyl/corpora/pull/15/files#diff-6a3e6d24567e31ebab17b57d0b74be2bbbc4464b259e794e21ecff1a793c91c1R16-R17) ```diff -file_content = ( - tar.extractfile(member).read().decode("utf-8", errors="replace") -) +extracted_file = tar.extractfile(member) +if extracted_file is not None: + file_content = extracted_file.read().decode("utf-8", errors="replace") ``` Suggestion importance[1-10]: 8 Why: This suggestion prevents potential `AttributeError` by ensuring that the extracted file is not `None` before attempting to read it. This is a critical fix to avoid runtime errors when processing tarball files.	8
Possible bug	Validate the response structure from the OpenAI API to prevent errors ___ Validate that the `response` from the OpenAI API contains the expected data structure before accessing elements to prevent potential `IndexError`. [py/packages/corpora_ai_openai/llm_client.py [29-31]](https://github.com/skyl/corpora/pull/15/files#diff-de8da8414122015059375c328610dcc9a2d9550504ca03bfaa97ec5eed468407R29-R31) ```diff -return response.data[0].embedding +if response.data and len(response.data) > 0: + return response.data[0].embedding +else: + raise ValueError("Unexpected response structure from OpenAI API") ``` Suggestion importance[1-10]: 8 Why: Validating the response structure from the OpenAI API before accessing its elements prevents potential `IndexError`, ensuring that the application handles unexpected API responses gracefully. This is an important improvement for error handling.	8
Possible issue	Add exception handling when loading the LLM provider to improve robustness ___ Consider handling potential exceptions when calling `load_llm_provider()` to ensure the application can gracefully handle any issues with loading the LLM provider. [py/packages/corpora/models.py [84-86]](https://github.com/skyl/corpora/pull/15/files#diff-c9de5374aa987d761f770eb99739e686761246600fc6cc0155902d79c9aa3ea5R84-R86) ```diff -llm = load_llm_provider() -summary = llm.get_summary(self._get_text_representation()) +try: + llm = load_llm_provider() + summary = llm.get_summary(self._get_text_representation()) +except Exception as e: + # Handle exception, e.g., log error or set a default summary ``` Suggestion importance[1-10]: 7 Why: Adding exception handling when loading the LLM provider increases the robustness of the application by preventing crashes due to unforeseen errors during the provider loading process. This is a valuable enhancement for maintaining application stability.	7
Enhancement	Add logging to track the execution of task chains for better monitoring ___ Consider logging the start and completion of each task in the task chain to aid in debugging and monitoring task execution. [py/packages/corpora/tasks.py [28-31]](https://github.com/skyl/corpora/pull/15/files#diff-6a3e6d24567e31ebab17b57d0b74be2bbbc4464b259e794e21ecff1a793c91c1R28-R31) ```diff +logger.info(f"Starting task chain for corpus file {corpus_file.id}") chain( generate_summary_task.s(corpus_file.id), split_file_task.s(corpus_file.id), ).apply_async() +logger.info(f"Task chain for corpus file {corpus_file.id} completed") ``` Suggestion importance[1-10]: 5 Why: Adding logging for task chain execution can aid in debugging and monitoring, providing insights into task progress and completion. However, it is a moderate enhancement as it does not directly affect functionality or correctness.	5

skyl commented 2 weeks ago

/review

skyl commented 2 weeks ago

/describe

github-actions[bot] commented 2 weeks ago

Persistent review updated to latest commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb

github-actions[bot] commented 2 weeks ago

Persistent review updated to latest commit https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb

github-actions[bot] commented 2 weeks ago

PR Description updated to latest commit (https://github.com/skyl/corpora/commit/5b9209deca26f7a9b260b9a7b5dce92288a4acdb)

skyl commented 2 weeks ago

I skimped on the tests a little bit but we will harden a bit later once we finish some of the core features and stabilize... or I'll take it in the next PR ...

The method get_and_save_summary and get_and_save_vector_of_summary in CorpusTextFile class directly load the LLM provider within the method. Consider injecting the dependency or using a service layer to improve testability and separation of concerns.

I agree ... I'll figure something out better later.

skyl / corpora