Open Keyrxng opened 1 week ago
Post a comment on the issue stating that the plugin has failed?
No the rest seems fine though.
Honestly maybe it does make sense to make a script to reinitialize the database with all the issues in case of new database starts or embedding generation failures?
I want to generalize our logic for every context like GitHub comments, telegram messages, Google drive documents etc. so I am skeptical of the initialization approach.
I suggest we populate issues as comments are posted, if needed. Right now, it's unnecessary to populate the DB with all issues upfront, especially when migration files drop tables and historic issues aren't required at the moment but once this plugin is stable then for sure pre-populating is a must.
Chatbot for DAO and Products: A high-level assistant to answer questions on GitHub and Telegram by chunking docs and using embeddings for semantic search. We could train an assistant specifically for this to avoid hitting the DB for every query.
Dev Onboarding: Aware of repos, setup instructions, readmes, and recent PRs/issues. Codebase handling is more complex due to size, requiring chunking for larger files.
Personal Assistants: Simulate developer responses, actions, and enforce org rules using comment embeddings.
issues:
node_id | plaintext | embedding | payload | author_id | markdown | created_at | modified_at
payload
is unused and redundant.plaintext
should be preferred over markdown
for sanitized input.issue_comments:
ID | created_at | modified_at | Embedding | payload | author_id | plaintext | issue_id | markdown
payload
is unnecessary; we only need author_association
.Originally I thought splitting tables migh be the best approach but we want to limit these to make interacting easier so I think lets adopt a unified content
table (name is debatable) to store all text-based data with the following structure:
Table: content
- id INT PRIMARY KEY
- source_id VARCHAR -- Original ID from the source ('node_id' , 'chat_id', etc.)
- type VARCHAR -- Content type ('issue', 'comment', 'message', etc.)
- plaintext TEXT -- Sanitized content
- embedding VECTOR -- Embedding vector for semantic search
- metadata JSON -- Additional info (author, association, repo_id, fileChunkCount, fileChunkIndex, filePath etc.)
- created_at TIMESTAMP
- modified_at TIMESTAMP
Advantages:
Use Case: Developer Onboarding Assistant
Implementation Steps:
Data Ingestion:
content
table with type
set to setup_instruction
.Embedding Generation:
plaintext
of each record.embedding
field for semantic search.Query Handling:
content
table for records with embeddings similar to the query and type
relevant to setup_instruction
. We NLP input and obtain the type
classification then index based on that which allows us to create a single 'chatbot' that can perform basic anything from onboarding help to in-depth task assistance.Response Generation:
Benefits:
type
field and metadata
allow for easy filtering based on context.Our early bot will be fairly basic, requiring manual context filtering. However, we can automate this over time by AI-ifying the process. For example:
setup_instructions
, dao_info
, reviews
, tasks
, dao_members
.Our classification schema will essentially form the backbone of the chatbot’s knowledge base. Each class will contain relevant text fed into the system based on the user query.
Starting with setup_instructions
(e.g., repos, readmes) and dao_info
(e.g., products, services, onboarding) makes sense as they are document-based, easy to test, and more coherent for initial chatbot builds.
On the other hand, handling more task-specific queries, like those related to the codebase or tasks, will require a different approach. These queries will need broader context, such as entire task conversations and detailed codebase knowledge, rather than just individual comment embeddings. Comment embeddings alone may be too narrow, as they capture literal strings like "do xyz and abc will happen" potentially losing the bigger picture
Alternatively if continuing on as is, we'll need to handle documents (i.e all of the notion stuff for onboarding and dao info), telegram etc all separate and with each new context a new table which will end up a mess and difficult to work with.
Instead, we centralized embeddings around a class system which we can use to create "versions" of the bot such as onboarding assistant, dao_info assistant etc in the background but the end user interacts with our one and only bot.
Consolidating to a single table seems like a good idea. It just might be slightly wasteful with the empty columns for different types of data but I think in order to make it simpler to work with this is fine.
ok
log shouldn't be shown if it's notok
.