mongodben / mongodb-oracle

The MongoDB Oracle 🧙‍♀️🔮🌱
https://mongodb-oracle.vercel.app
8 stars 3 forks source link


The MongoDB Oracle

MongoDB Docs Q&A Bot. Ask it questions about MongoDB. Get answers that are usually correct.

Try it out - https://mongodb-oracle.vercel.app/

How It Works

The MongoDB Oracle uses AI to answer questions about MongoDB. It uses 2 AI paradigms to achieve this: vector-based search and large language model summarization.

An indexing script creates vector embeddings for the all the data in the MongoDB documentation using the OpenAI Embeddings API. These embeddings are stored along with the documentation text that the embeddings represent in MongoDB Atlas. The data is then index in Atlas using the Atlas Search knnBeta.

When an end user creates a question to the database, the question is converted to an embedding using the OpenAI Embeddings API. The question's embedding is then used to query the indexed data. Atlas Search returns the n most relevant results, and links to the pages where these results came from.

These results are then passed to the OpenAI's GPT 3.5 large language model using the OpenAI Chat API, which summarizes the results formatted in Markdown.

The Markdown-formatted results are then returned to the client.

To learn more about this paradigm for AI-powered Q&A bots, refer to this article - https://dagster.io/blog/chatgpt-langchain. We didn't use Langchain like this article, but we take the same approach and use the same AI APIs.

Architecture

Data Ingestion

etl data ingestion

Q & A

Q and A ingestion

Issues & Thoughts on Future Direction

Current Issues

As the project currently stands, it works reasonably well. The biggest remaining issues are:

  1. Vector search picks up short pieces of irrelevant data, and then feeds them to LLM summarization. This could be remedied by improving the quality of the data ingestion scripts in the app/generate-index directory. Improvements could include making sure that all embeddings cover some larger number of tokens (say >=500).
  2. The ChatGPT LLM can hallucinate answers if it doesn't know the real answer. This is most pronounced for links, which it makes up more often than we feel comfortable with. This could likely be ameliorated by reducing the temperature for the LLM responses. Though due to the above issue of vector search sometimes not picking up the most relevant data, reducing the temperature currently leads to an unacceptable amount of 'do not know' type answers. Refining the LLM prompt could also probably assist with improving the answer quality.

Both of the above problems seem quite solvable. We just weren't able to address them because we didn't have enough time in Skunkworks for it.

Next Steps

In addition to resolving the above issues, some additional features we though about developing but didn't have time for in Skunkworks include:

  1. Alternate interfaces (Slack bot, web component, etc.)
  2. Have multi-language selector so you can ask and get responses in a variety of languages.
  3. 'Productize it' so people can create their own 'MongoDB Oracle' (probably with a different name 😅) for their own data.
  4. Make the index generator pluggable, so it's easy to ingest various data sources.
  5. Make the query API pluggable, so people can develop alternative interfaces.
  6. Look into using/integrating Langchain, a new popular library for building LLM-related projects.

Skunkworks March 2023 MVP

The MVP to be completed during Skunkworks March 2023 (Skunkalodeon) should have the following components:

Web Frontend

Notable not doing:

Web Server Backend

Index Search Data

Data Layer - MongoDB Atlas with Atlas Search

Post MVP Features

Once we finish the above MVP, some other nice features to add during Skunkworks could include:

In the end we focused on other features more, which can be seen here - https://github.com/mongodben/mongodb-oracle/milestone/2

Understand this repo

Most all the code is in the app directory, which is a Next.js app with some scripts in there.