chat with github repository

edublancas commented 8 months ago

we want to build an application similar to this one but using FastAPI (we'll only build the API for now, we'll tackle the frontend later)

there's a few endpoints that this app needs:

and endpoint to parse the contents of a repo POST /scrape/ should scrape the data from a github repo (should take repo in the body request. eg ploomber/jupysql) and return a repo_id

GET /status/{repo_id} should return the status of a repo_id: pending, finished to check whether scraping has finished

POST /ask/ should contain a question and repo_id in the body, should return the answer to the question using that repository

since scraping will take a minute or so, we need to implement a task queue to run jobs in the background, I think celery is the simplest option

important: everything has to be prepared in a single dockerfile

bryannho commented 8 months ago

@edublancas quick update: I have a skeleton API running using FastAPI and Celery for the task queue. Working on porting over the functionality from the original chat-with-github example.

Some clarification: the original app defines an IndexLoader which downloads and parses a repo into VectorStoreIndex, one repo at a time. We want the new API to download/parse about multiple repos at once in the background, correct? Then it can load the repo and answer the question whenever a POST /ask/ is made.

My approach is:

Create a folder which will contain a .metadata.json file and /indexes folder. .metadata.json will store the id, status and path to each repo. /indexes will contain an individual .pickle file for each repo's content.
POST /scrape/ will create a new entry into .metadata.json, start the download task asynchronously, and return the id. The download task gets the repo contents, creates and saves index file from it, then updates the status and path in .metadata.json.
GET /status/{repo_id} just returns the status from .metadata.json.
POST /ask/ looks up the path from .metadata.json, passes the .pickle file to the LLM to answer the question.

Let me know how this sounds.

edublancas commented 8 months ago

overall your approach sounds good, just on suggestion:

you can use SQLite + sqlalchemy for this, it'll required a bit more code but it'll make saving/storing a lot more reliable, check out the code from this example

and what do you want to store in the pickle file?

bryannho commented 8 months ago

@edublancas okay that makes sense, I'll try out the sqlite method.

The pickle file stores the VectorStoreIndex that is made from a repo, so there will be an index_repo-id.pickle file for each repo that is parsed. This way we can just load the VectorStoreIndex and use it to answer questions faster than having to create a new one every time a user asks a question. btw I didn't decide on this, this is just how they did it in the original chat-with-github example.

edublancas commented 8 months ago

try swapping the VectorStoreIndex with lanceDB, it'll allow you to persist it using their format instead of pickle: https://docs.llamaindex.ai/en/stable/examples/vector_stores/LanceDBIndexDemo.html

if it doesn't work, pickle is ok

edublancas commented 8 months ago

@bryannho let's now build a frontend. let's use Solara this time, I think you built the arxiv chat with solara right? you can use the same code for the chat

bryannho commented 8 months ago

@edublancas question on the frontend design:

In the original Panel app, the user enters in the owner, repo, and branch info via a form on the side panel. The app then loads and repo and the user can use the chat interface on the main panel. For reference:

Should we replicate this design? Or have it purely a chat interface as we did with Arxiv Chat?

If we use the Panel design, it makes the logic to load repos a little more simple. The original app only allows for loading one repo at once but the new one will allow multiple. Either way, I'll need to use OpenAI function calling to discern which repo the user is asking about - but if we use a pure chat interface, I'll also need the LLM to decide if the user is asking to load a new repo.

edublancas commented 8 months ago

yeah let's keep the panel design (user selects which repo to build), sounds like that's simpler

ploomber / doc

chat with github repository #148