LLM integration - Githubissues

josevalim commented 11 months ago

We want to integrate LLMs as part of Livebook itself. There are at least four distinct levels this can happen:

Code completion (may or may not need a LLM) (options: Codeium, Copilot, fine-tuned Repl.it model)
Chat-based (which also includes selecting code and asking to document it as well as explaining exceptions) (options: Codeium, Copilot, OpenAI, Claude)
~Semantic-search over all installed packages in the runtime (may or may not need a LLM) (options: Codeium, Copilot, OpenAI, Claude)~
~Function completion (we can allow kino/smart-cells to register functions which we hook into prompts, similar to HF Agents) (options: OpenAI)~

This is a meta-issue and we are currently doing proof of concepts on different areas. There is no clear decision yet. We will most likely allow users to "bring their own LLM" for most of these categories, especially 2-3-4 (either commercial or self-hosted).

vitalis commented 11 months ago

@josevalim I tested all the suggested solutions: Codeium, Copilot, OpenAI, Claude. None of them will be useful with Elixir(btw Codeium is the worst)... In my opinion it will be best to take small open source model and train it explicitly for Elixir.. We can host it on huggingface for free. I think (and this assumption must be validated) that the huggingface rate limit for open source project will be sufficient for existing Livebook users... Anyway.. I'm here to help with any direction you will choose...

josevalim commented 11 months ago

@vitalis for completeness, you tested for which particular categories above? How were your tests done and how did you assess quality?

And, of course, even if we decide to train our own model, there is no guarantee it will perform better. For code completion there are smaller models available but how would we proceed to train llama2 or similar and which data + resource do we need for such?

Thanks!

vitalis commented 11 months ago

Code completion: None of the solutions is helpful with Elixir, I tested Copilot before you spoke about this idea on twitch stream and tested all other solutions in the past days... In my (subjective) opinion none of them will provide what YOU expect from code completion, they will do more harm than good.. Chat-based: ChatGPT(4) is the best solution (from the suggested solutions), but it's $20 a month.. I'm not sure many people will pay this in order to work with Livebook / learn Elixir Semantic-search: again... none of the solutions is helpful with Elixir

vitalis commented 11 months ago

From my understanding it will be really difficult to optimize llama2 model or similar, llama2 is a huge model and it will require a lot of GPU resources and time and it will be very expensive...

vitalis commented 11 months ago

At this moment I don't have a solution.. but I have an idea I want to research next week and I'll update you

michaelterryio commented 11 months ago

If you did chat-based with GPT you wouldn't be using the subscription, you'd have users plug in a pay-as-you-go API key, right? Lots and lots of people are already using GPT for way more than Livebook, so I don't think a GPT token requirement would be the limiter in traction.

For the record, the chat-based workflow has been a tremendous accelerant to my Elixir app development, even without a built-in integration with Livebook.

vitalis commented 11 months ago

@michaelterryio It's not just "plug in a pay-as-you-go API key"... it requires development.. I agree that chat-based workflow can be major improvement in app development My (subjective) opinion is: you don't need Livebook/GPT integration in order to use GPT as chat-based workflow.. Again, in my (subjective) opinion if time will be invested in integration with any LLM it should be something game changing... and not something mainstream.. the trend today is adding GPT generator to every app.. and from what I see it's useless... (using GPT directly and not via integration is much better)

vitalis commented 11 months ago

@josevalim Tested all other available models: CodeWhisperer, StarCoder, WizardCoder...and many more I'm testing the models with a simple task of writing "bubble sort" The worst model at this moment is CodeWhisperer, Elixir is not supported at all.. But here is an example of bubble sort by WizardCoder:

Didn't test replit.. to complex and no free plan..

I'll continue the research next week

josevalim commented 11 months ago

StarCoder, WizardCoder, and the replit one would have to be fine-tuned first. I am not expecting them to give good results out of the box.

vitalis commented 11 months ago

@josevalim Is there existing dataset for Elixir?

seivan commented 9 months ago

I think 1 and 2 are related. You could create embeds of all the cells in a notebook and store them in some vector table (sqlite?). When querying the chat, create embeds of the query and previous context and use that. This would let you at least ask questions about your notebook. Which is useful, not sure how impactful it would be. Though not sure how useful if the models don't understand Elixir well enough :/

jonastemplestein commented 9 months ago

Hey folks, just joining the discussion from #2222

A few thoughts from my side

In addition to the types of LLM functionality you mention @josevalim, I'd add two others
- ChatGPT "Advanced Data Analysis" and open interpreter go beyond a simple chat panel and implement a full "development loop". You'd be able to say e.g. "I want to generate all strings with 5 characters that aren't english language words but are easy to pronounce" (aka pseudowords) and it works out how to do that. Both open interpreter and advanced data analysis did that for me almost entirely unassisted
- Another small quality of life feature, is automatically fixing errors and warnings. In Cursor I can just click "Fix with AI" on any error or warning and it works about 50% of the time
The only model that I've tried that gives me really good results for Elixir is GPT-4. I use it dozens of times every day in the Cursor editor
With GPT-4, retrieval augmented generation (storing embeddings of documentation and code in vector DB and pulling them into prompt context) is a mixed bag so far
- It works really well for "teaching" the model how to use specific unfamiliar APIs and packages. Cursor has a feature where you point it at the online documentation of a package and it indexes it in a vector store. You can then "@-mention" the documentation in your LLM prompt and Cursor tries to find relevant bits of documentation and adds them to the context. So I can say "Build a phoenix live component that allows users to input a prompt and then streams back a response from ChatGPT using @openai_ex" (assuming I previously manually added and indexed @openai_ex docs, but that could obviously be automated). That gets me quite far
- It doesn't work super well for "codebase level" questions in large codebases. Cursor has a feature where it stores embeddings for all files in the codebase (or maybe it does it on a per-function basis or something) and then you can ask questions about the whole codebase, where it pulls relevant snippets from files in the codebase into the context. For example, for my learning exercise of adding Cmd + k to livebook, I opened the project in Cursor and asked questions such as "How do I add a keyboard shortcut in the cell editor?". It gave me some useful pointers, but it doesn't
The main problem with retrieval augmented generation and GPT-4 is the relatively small context window. So you need to be very sure that the context is relevant enough to include in the prompt, and that means you end up omitting a lot of relevant context
Anthropic Claude is much better at considering large contexts (and it's much faster!), but can't reason about code very well. I hear that the next version of Claude closes that gap and I'm pretty stoked for that. In general, once the coding ability is "good enough" in a model, speed becomes the most important feature. The faster the model is, the lower the threshold for actually using it
My sense is that a model that can already program and reason about code really well (like GPT-4, the upcoming version of Claude and possibly future larger versions of code llama or similar) would benefit a lot from Elixir specific fine-tuning (compared to just retrieval augmented generation). The big benefit elixir (and phoenix) ecosystem has in this regard is that everything is incredibly well documented and there is a good canon set of information that you can fine-tune on (e.g. a docs of elixir, phoenix and most commonly used packages, OSS projects like livebook, phoenix sample apps, elixir forum/slack/discord posts, etc)
Another useful source of data for fine-tuning would be human feedback coming from feedback buttons on generated content within livebook or on an "AI bot" that lives in elixir forum, slack, discord and github. Again, this is an area where the small, close-knit elixir community with a benevolent dictator may be able to do things that other communities can't.
If the end-result is a model that can do most data analysis tasks end-to-end in a livebook or implement entire product features in a phoenix app, I don't think fine-tuning would be cost-prohibitive, even if it costs several hundred thousand dollars

As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.

Let me know if there's anything specifically you'd like me to ask those folks

Apologies for the long ramblings, I'm awake in the middle of the night and severely jetlagged 🤣

josevalim commented 9 months ago

Hi @jonastemplestein, that was a great report. :)

[RAG] doesn't work super well for "codebase level" questions in large codebases.

That has also been my experience, even before we reach the LLM layer. Indexing all of the Elixir API docs and then performing an embedding search did not provide good results. Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as String.to_charlist, and feeding that into the LLM made it worse.

You can then "@-mention" the documentation in your LLM prompt and Cursor tries to find relevant bits of documentation and adds them to the context.

This is an amazingly cool idea, being able to "@-mention" the documentation. And given how Elixir stores docs, it should be trivial to include the documentation of something. The only problem I can think here is about guides: we can programmatically fetch API docs but not its guides. Although I would assume for something as large as Ecto or Phoenix, a lot of knowledge already exists in the neural network, because you can't pass all of its docs or guides in the context.

Perhaps initially we should focus on supporting some sort of @-mention and passing the notebook itself as context.

That said, I will strike out the RAG idea from the initial issue description, I believe we should focus on two separate problems: chat-based interfaces and code completion. For the chat-based interface (and everything that falls out of it, such as explaining code, explaining exceptions, refactoring code, automatically fixing errors, etc), we will likely need to use an external LLM, due to the required size. But I still hope we can run our own model for code completion, fine-tuned to Elixir.

As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.

We can fine tune any of them to Elixir, it would be amazing, and it would definitely help us gear towards a decision. Let me know if there is anything I can do to help. We can also help with collecting the data. Both companies probably already have crawlers which they could point to hex.pm but we can also try to build a large archive with all API references from Hex.pm. The only challenge is that this would need to be refreshed/updated somewhat periodically.

seivan commented 9 months ago

Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as String.to_charlist, and feeding that into the LLM made it worse.

Can I bother you for the exact details on how you implemented the RAG here? What was the exact search method? The reason I ask, is because capturing meaning (at least in this case) matters. If your query is "what is " and the documentation says "A charlist is...." then that paragraph should probably have been included in your context, or am I way way off?

Edit: I just want to point out, I actually don't disagree, just wondering what went wrong.

josevalim commented 7 months ago

Hi folks, we have a rough plan here.

Code completion (#1)

We are still exploring options for code completion. I would really love if this is something we can run in Elixir itself and people could run it on their machines or their own servers trivially (assuming at most 3b params). Our preferences (in order) are:

A fine-tuned code completion model, such as StarCoder or Replit
TabbyML or llama.cpp server - open source but not in Elixir
A free closed source project, such as Codeium
Paid closed source project (OpenAI, Copilot, etc)

Chat, RAG and functions (#2, #3, #4)

With OpenAI latest announcements, they have an offering (Assistants) that address problems 2, 3, and 4 for us, which makes it very compelling. Therefore, it is likely we will move forward by integrating with anything that exposes OpenAI assistant API. This means:

There are several tools that expose OpenAI API on top of open source models, and it is just a matter of time until they are updated to have the new assistant API on top of LLama, Mistral, etc. This would allow anyone to bring their own LLM too
Companies and users can bring their own OpenAI assistant into Livebook (which, for example, could include their own private docs for retrieval)

livebook-dev / livebook

LLM integration #2073

Code completion (#1)

Chat, RAG and functions (#2, #3, #4)