ubiquity-os-marketplace / text-vector-embeddings

0 stars 6 forks source link

onboarding bot #17

Open Keyrxng opened 2 weeks ago

Keyrxng commented 2 weeks ago

Technically there will be two:

  1. Service and Products: Notion doc type embeddings generated for this, high level DAO overview.
  2. Developer onboarding: Initially the focus will be on an org-wide understanding of all repo readmes, they'll be used for a basic setup and walkthrough guide for any given repo.

This could be split into two separate tasks or combined as one.

Number two is easy, we run on push events and ID any added or changed .md files and we are done. Notion doc scanning isn't something that we can listen for webhook and go I don't think. So maybe we could have a cron job to run once every 30-60 days and parse the notion docs? I'm sure we can grab the pages from the API with a valid API key.


Before we automate notion we need to decide:

0x4007 commented 2 weeks ago

Our notion tends to fall out of date. The source of truth is on GitHub. It would be interesting to have a plugin that will check what seems to be inaccurate or out of date on notion. But architecture seems a bit unclear. Until that's cleared up, I wouldn't want to prioritize making the notion embeddings plugin yet.

Same with readmes. I think the better approach is to make a chatbot that you can q&a. Secondly it could update the readmes when it thinks something is not accurate.

Keyrxng commented 2 weeks ago

Our Notion tends to fall out of date. The source of truth is on GitHub.

The simple solution is to move the Notion docs into a repo for easier syncing.

Creating Notion pages via the API is possible but involves a learning curve, unless GPT-4 can handle it, which it probably can. notion-github-sync example.

Alternatively, we could parse text from Notion using the API and convert it into Markdown docs.

Respond to Notion DB page change example: We could set something like this up to listen for Notion DB changes and re-run embeddings on the updated content.


I understand that GitHub is the source of truth for plugins, active teams, project overviews, comments, etc. However, as far as I know, we don't have any onboarding docs on GitHub.

I think the better approach is to make a chatbot that you can q&a.

What info would you feed the chatbot if not from Notion or READMEs? To handle Q&A on high-level org info like UbiquityOS, DevPool, Cards, and DeFi, you'd need more than task specs or comments.

Secondly it could update the readmes when it thinks something is not accurate.

I wasn't referring to updating READMEs. They contain project intent, setup instructions, and references to our architecture, which are great for org-wide context (e.g., what plugins we have, how to install them). However, they don’t cover topics like DevPool, onboarding, recruitment, or investors, which is what Notion docs handle.


We need solid text chunks that explain things. Individual comments and task specs aren't enough for an org-aware chatbot. Each embedding references its text source; they aren't merged into a single context. So a user query becomes an embedding and gets compared against issue comments and task specs.

For example, "Help me set up the kernel" would likely return task conversations with little value. Similarly, "What is UbiquityOS?" would return technical details instead of a comprehensive overview.


Each vector has a size. Larger vectors store more info but are more computationally expensive. We're currently using 1,024 dimensions for all embeddings. That’s fine for small comments, but for entire conversations or codebases, you might want 3,072 dimensions to capture more context.

To build a good Q&A chatbot, we need embeddings from full documents, which is how traditional AI chatbots are made. Notion docs and READMEs are already written and could easily power a V1 Q&A chatbot.


Eventually, the best approach is to use the embeddings to train or fine-tune our own model so that it has this knowledge built-in, rather than fetching it in real-time. The DAO info would be ideal for this—letting the model start with foundational knowledge and use embeddings for context.

Question: Is the vision to have a single chatbot entry (e.g., /chatbot query...) across platforms like TG, GitHub, or ubq.fi, capable of handling everything? Or are you thinking of separate bots for specific purposes, like one handling DAO queries only?

Keyrxng commented 2 weeks ago

@sshivaditya2019 request for comment regarding chatbot creation, knowledge base etc. Do you agree/disagree with what I have said? You seem to be more knowledge than myself with these things and it's been a while since I built a chatbot so I may be out of touch with it a little.

sshivaditya2019 commented 2 weeks ago

@sshivaditya2019 request for comment regarding chatbot creation, knowledge base etc. Do you agree/disagree with what I have said? You seem to be more knowledge than myself with these things and it's been a while since I built a chatbot so I may be out of touch with it a little.

@Keyrxng I think it would be beneficial to have a dedicated text corpus from Notion for creating embeddings and conducting similarity searches. You're correct that for an organization-wide intelligent chatbot, we should have multiple text corpora, ideally focused on the DAO or the provider (Essential to prevent model hallucination, ordinary Issue spec, and tasks is simply insufficient)

If Notion poses challenges, we could consider using GitHub Pages to maintain Markdown or HTML files for resources. This would offer versioning and serve as a more reliable source of truth for the chatbot.

0x4007 commented 2 weeks ago

Individual comments and task specs aren't enough for an org-aware chatbot.

Another reason why I have always been against direct messages: we now can pass in historical data into LLMs.

We have 90+% of all recent and relevant ideas/conversations/plans of Ubiquity accessible across GitHub comments and telegram org chat messages.

Each embedding references its text source; they aren't merged into a single context

Merge everything remotely relevant into a single context

sshivaditya2019 commented 2 weeks ago

Individual comments and task specs aren't enough for an org-aware chatbot.

Another reason why I have always been against direct messages: we now can pass in historical data into LLMs.

We have 90+% of all recent and relevant ideas/conversations/plans of Ubiquity accessible across GitHub comments and telegram org chat messages.

Each embedding references its text source; they aren't merged into a single context

Merge everything remotely relevant into a single context

Just to provide context

image

This is how, embeddings would look for our concepts (or knowledge base)(Reduced Dimensionality). In the case of multiple smaller comments and messages, when we run nearest neighbors, we would get a lot of random noise (Eg: Filler Words, Wrongly Related Concepts), this would cause the model to hallucinate sometimes producing wrong outputs.

From what I have seen, in longer contexts, models some time loose information presented earlier this happens even in large context models. I think its better to have few Succinct and Curated Knowledge Base over comments or chats.

Source:

Keyrxng commented 2 weeks ago

We have 90+% of all recent and relevant ideas/conversations/plans of Ubiquity accessible across GitHub comments and telegram org chat messages.

@0x4007 How do you envision this Q&A chatbot being used and by who? I've assumed on all platforms and by every demographic we have. Could you maybe show a couple of scenario Q & A's?

models some time loose information presented earlier this happens even in large context models.

A prime example of that here stuffing input of 30k tokens in it starts to lose it's self. guard rails at that context depth seem to go out of the window a little bit unless you step through sections but it's difficult to get the output you want every single time.

That's an interesting way to visualize things and the blog is a good read too.

My current set of "whole file" embeddings comes in at about 1.6 MB, which is more than three times the size of my real search index (which, at 419 KB contains all the text content of all my pages). Yep, the JSON file with a single embedding for each post is larger than the actual content of all those posts. And to get search results that compete with, say, Fuse.js' matching engine I'd need even more embeddings; ideally one per paragraph of text.

That sets a tone that is not inline with the long terms goals of the DAO re: database dependency, costs etc. Although I've been charged $0.1 for about 18 days worth of embeddings, not that many in total nor very long either.

this would cause the model to hallucinate sometimes producing wrong outputs.

This is an acceptable loss depending on the context in which the chatbot is being used. If it's DAO and Services then no it should never hallucinate, if it's onboarding like setup and install ofc we have to allow some freedom as it doesn't have codebase knowledge. If we had embeddings of entire codebases then it would be expected that hallucinations would be at an absolute minimum.


I read this and this recently when considering codebase embeddings. That's the real task as we need to chunk efficiently (i.e like a with a code parser), handle overlaps to "extend" context between embeddings, etc. I used LangChain before which had a lot of helpful tools for these things but we avoid it as an org. Have you done anything at that sort of scale before? I haven't to be honest.

Comments from a mod and a "leader" on the OpenAI forum:

For example, I have every function/module I have written in two databases. One for latest code, and one for historical/legacy code. If you embed each of these entries, you could have your own search. Right now I search using regex against the latest database. However, this is a great idea, I should just search by correlating the embeddings instead!

Getting good quality code from embeddings is a tricky business, similarity does not equal functionality, and code that is semantically similar can give very different results. It doe “work” just not very well, maybe others here have had more luck. I wish you well

For the semantic part, documentation of the functions or code overall is important. For the functional part, you can embed the AST of a file, class, method, globals etc. The granularity is up to you. Consider chunking and overlapping for this.

I had previously considered using something like ctags or GNU Global source code tagging system (what IDE language servers etc use) a while back as I think it would go a long way to help produce data rich codebase embeddings since we have no JSDoc type docs, but I haven't researched it practically.

0x4007 commented 1 week ago

@0x4007 How do you envision this Q&A chatbot being used and by who? I've assumed on all platforms and by every demographic we have. Could you maybe show a couple of scenario Q & A's?

@UbiquityOS how do I set up and start this project?

Sure! Our projects are based off of ts-template which relies on yarn 1.21. Be sure to:

  1. Install nodejs...

Anyways with a bit of prompting I'm quite certain this will work good enough. I've already done experiments in the past with more primitive models, and no embeddings, that worked fine