Store issue if not stored already

Keyrxng commented 1 week ago

If the issue does not already exist in the DB then store it to allow comments to be stored.
Correct the error handling; when an error is thrown an ok log shouldn't be shown if it's not ok.
Post a comment on the issue stating that the plugin has failed?

0x4007 commented 1 week ago

Post a comment on the issue stating that the plugin has failed?

No the rest seems fine though.

Honestly maybe it does make sense to make a script to reinitialize the database with all the issues in case of new database starts or embedding generation failures?

I want to generalize our logic for every context like GitHub comments, telegram messages, Google drive documents etc. so I am skeptical of the initialization approach.

Keyrxng commented 1 week ago

I suggest we populate issues as comments are posted, if needed. Right now, it's unnecessary to populate the DB with all issues upfront, especially when migration files drop tables and historic issues aren't required at the moment but once this plugin is stable then for sure pre-populating is a must.

Use Cases:

Chatbot for DAO and Products: A high-level assistant to answer questions on GitHub and Telegram by chunking docs and using embeddings for semantic search. We could train an assistant specifically for this to avoid hitting the DB for every query.
Dev Onboarding: Aware of repos, setup instructions, readmes, and recent PRs/issues. Codebase handling is more complex due to size, requiring chunking for larger files.
Personal Assistants: Simulate developer responses, actions, and enforce org rules using comment embeddings.

Current Schema Observations:

issues:

node_id | plaintext | embedding | payload | author_id | markdown | created_at | modified_at
- payload is unused and redundant.
- plaintext should be preferred over markdown for sanitized input.

issue_comments:

ID | created_at | modified_at | Embedding | payload | author_id | plaintext | issue_id | markdown
- payload is unnecessary; we only need author_association.

Proposed Solution

Originally I thought splitting tables migh be the best approach but we want to limit these to make interacting easier so I think lets adopt a unified content table (name is debatable) to store all text-based data with the following structure:

Table: content
- id            INT PRIMARY KEY
- source_id     VARCHAR            -- Original ID from the source ('node_id' , 'chat_id', etc.)
- type          VARCHAR            -- Content type ('issue', 'comment', 'message', etc.)
- plaintext     TEXT               -- Sanitized content
- embedding     VECTOR             -- Embedding vector for semantic search
- metadata      JSON               -- Additional info (author, association, repo_id, fileChunkCount, fileChunkIndex, filePath etc.)
- created_at    TIMESTAMP
- modified_at   TIMESTAMP

Advantages:

Generalization: Handles multiple content types uniformly, facilitating logic generalization across contexts like GitHub comments, Telegram messages, and documents.
Storage Efficiency: Reduces redundancy by eliminating unnecessary tables and fields.
Flexibility: Easily accommodates new content types and adapts to various use cases without altering the schema.

Use Case Example and Implementation

Use Case: Developer Onboarding Assistant

Implementation Steps:

Data Ingestion:
- Collect README files, setup guides, and onboarding documents.
- Store each as a record in the content table with type set to setup_instruction.
Embedding Generation:
- Generate embeddings for the plaintext of each record.
- Store embeddings in the embedding field for semantic search.
Query Handling:
- When a user asks, "Help me set up the kernel," convert the query into an embedding.
- Search the content table for records with embeddings similar to the query and type relevant to setup_instruction. We NLP input and obtain the type classification then index based on that which allows us to create a single 'chatbot' that can perform basic anything from onboarding help to in-depth task assistance.
Response Generation:
- Retrieve the most relevant content.
- Feed the information to the language model to generate a coherent guide for the user.

Benefits:

Efficient Context Switching: The type field and metadata allow for easy filtering based on context.
Scalable Partner Implementation: Partners can adopt this simplified schema, minimizing storage costs and setup complexity.
Versatility: Supports multiple applications (chatbots, personal assistants, development aides) using the same underlying structure.

Our early bot will be fairly basic, requiring manual context filtering. However, we can automate this over time by AI-ifying the process. For example:

We use NLP to classify the input into one of 5 internal categories: setup_instructions, dao_info, reviews, tasks, dao_members.
From there, we search for relevant embeddings across these categories, filtering out any results below a set relevance threshold.
Extract the top results, feed them to the LLM, and build a comprehensive, multi-source context for a more well-rounded, org-aware chatbot.

Our classification schema will essentially form the backbone of the chatbot’s knowledge base. Each class will contain relevant text fed into the system based on the user query.

Starting with setup_instructions (e.g., repos, readmes) and dao_info (e.g., products, services, onboarding) makes sense as they are document-based, easy to test, and more coherent for initial chatbot builds.

On the other hand, handling more task-specific queries, like those related to the codebase or tasks, will require a different approach. These queries will need broader context, such as entire task conversations and detailed codebase knowledge, rather than just individual comment embeddings. Comment embeddings alone may be too narrow, as they capture literal strings like "do xyz and abc will happen" potentially losing the bigger picture

Alternatively if continuing on as is, we'll need to handle documents (i.e all of the notion stuff for onboarding and dao info), telegram etc all separate and with each new context a new table which will end up a mess and difficult to work with.

Instead, we centralized embeddings around a class system which we can use to create "versions" of the bot such as onboarding assistant, dao_info assistant etc in the background but the end user interacts with our one and only bot.

0x4007 commented 1 week ago

Consolidating to a single table seems like a good idea. It just might be slightly wasteful with the empty columns for different types of data but I think in order to make it simpler to work with this is fine.

ubiquity-os-marketplace / generate-vector-embeddings