livebook-dev / livebook

Automate code & data workflows with interactive Elixir notebooks
https://livebook.dev
Apache License 2.0
4.62k stars 408 forks source link

LLM integration #2073

Open josevalim opened 11 months ago

josevalim commented 11 months ago

We want to integrate LLMs as part of Livebook itself. There are at least four distinct levels this can happen:

  1. Code completion (may or may not need a LLM) (options: Codeium, Copilot, fine-tuned Repl.it model)

  2. Chat-based (which also includes selecting code and asking to document it as well as explaining exceptions) (options: Codeium, Copilot, OpenAI, Claude)

  3. ~Semantic-search over all installed packages in the runtime (may or may not need a LLM) (options: Codeium, Copilot, OpenAI, Claude)~

  4. ~Function completion (we can allow kino/smart-cells to register functions which we hook into prompts, similar to HF Agents) (options: OpenAI)~

This is a meta-issue and we are currently doing proof of concepts on different areas. There is no clear decision yet. We will most likely allow users to "bring their own LLM" for most of these categories, especially 2-3-4 (either commercial or self-hosted).

vitalis commented 11 months ago

@josevalim I tested all the suggested solutions: Codeium, Copilot, OpenAI, Claude. None of them will be useful with Elixir(btw Codeium is the worst)... In my opinion it will be best to take small open source model and train it explicitly for Elixir.. We can host it on huggingface for free. I think (and this assumption must be validated) that the huggingface rate limit for open source project will be sufficient for existing Livebook users... Anyway.. I'm here to help with any direction you will choose...

josevalim commented 11 months ago

@vitalis for completeness, you tested for which particular categories above? How were your tests done and how did you assess quality?

And, of course, even if we decide to train our own model, there is no guarantee it will perform better. For code completion there are smaller models available but how would we proceed to train llama2 or similar and which data + resource do we need for such?

Thanks!

vitalis commented 11 months ago

Code completion: None of the solutions is helpful with Elixir, I tested Copilot before you spoke about this idea on twitch stream and tested all other solutions in the past days... In my (subjective) opinion none of them will provide what YOU expect from code completion, they will do more harm than good.. Chat-based: ChatGPT(4) is the best solution (from the suggested solutions), but it's $20 a month.. I'm not sure many people will pay this in order to work with Livebook / learn Elixir Semantic-search: again... none of the solutions is helpful with Elixir

vitalis commented 11 months ago

From my understanding it will be really difficult to optimize llama2 model or similar, llama2 is a huge model and it will require a lot of GPU resources and time and it will be very expensive...

vitalis commented 11 months ago

At this moment I don't have a solution.. but I have an idea I want to research next week and I'll update you

michaelterryio commented 11 months ago

If you did chat-based with GPT you wouldn't be using the subscription, you'd have users plug in a pay-as-you-go API key, right? Lots and lots of people are already using GPT for way more than Livebook, so I don't think a GPT token requirement would be the limiter in traction.

For the record, the chat-based workflow has been a tremendous accelerant to my Elixir app development, even without a built-in integration with Livebook.

vitalis commented 11 months ago

@michaelterryio It's not just "plug in a pay-as-you-go API key"... it requires development.. I agree that chat-based workflow can be major improvement in app development My (subjective) opinion is: you don't need Livebook/GPT integration in order to use GPT as chat-based workflow.. Again, in my (subjective) opinion if time will be invested in integration with any LLM it should be something game changing... and not something mainstream.. the trend today is adding GPT generator to every app.. and from what I see it's useless... (using GPT directly and not via integration is much better)

vitalis commented 11 months ago

@josevalim Tested all other available models: CodeWhisperer, StarCoder, WizardCoder...and many more I'm testing the models with a simple task of writing "bubble sort" The worst model at this moment is CodeWhisperer, Elixir is not supported at all.. But here is an example of bubble sort by WizardCoder:

image

Didn't test replit.. to complex and no free plan..

I'll continue the research next week

josevalim commented 11 months ago

StarCoder, WizardCoder, and the replit one would have to be fine-tuned first. I am not expecting them to give good results out of the box.

vitalis commented 11 months ago

@josevalim Is there existing dataset for Elixir?

seivan commented 9 months ago

I think 1 and 2 are related. You could create embeds of all the cells in a notebook and store them in some vector table (sqlite?). When querying the chat, create embeds of the query and previous context and use that. This would let you at least ask questions about your notebook. Which is useful, not sure how impactful it would be. Though not sure how useful if the models don't understand Elixir well enough :/

jonastemplestein commented 9 months ago

Hey folks, just joining the discussion from #2222

A few thoughts from my side

As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.

Let me know if there's anything specifically you'd like me to ask those folks

Apologies for the long ramblings, I'm awake in the middle of the night and severely jetlagged 🤣

josevalim commented 9 months ago

Hi @jonastemplestein, that was a great report. :)

[RAG] doesn't work super well for "codebase level" questions in large codebases.

That has also been my experience, even before we reach the LLM layer. Indexing all of the Elixir API docs and then performing an embedding search did not provide good results. Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as String.to_charlist, and feeding that into the LLM made it worse.

You can then "@-mention" the documentation in your LLM prompt and Cursor tries to find relevant bits of documentation and adds them to the context.

This is an amazingly cool idea, being able to "@-mention" the documentation. And given how Elixir stores docs, it should be trivial to include the documentation of something. The only problem I can think here is about guides: we can programmatically fetch API docs but not its guides. Although I would assume for something as large as Ecto or Phoenix, a lot of knowledge already exists in the neural network, because you can't pass all of its docs or guides in the context.

Perhaps initially we should focus on supporting some sort of @-mention and passing the notebook itself as context.


That said, I will strike out the RAG idea from the initial issue description, I believe we should focus on two separate problems: chat-based interfaces and code completion. For the chat-based interface (and everything that falls out of it, such as explaining code, explaining exceptions, refactoring code, automatically fixing errors, etc), we will likely need to use an external LLM, due to the required size. But I still hope we can run our own model for code completion, fine-tuned to Elixir.

As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.

We can fine tune any of them to Elixir, it would be amazing, and it would definitely help us gear towards a decision. Let me know if there is anything I can do to help. We can also help with collecting the data. Both companies probably already have crawlers which they could point to hex.pm but we can also try to build a large archive with all API references from Hex.pm. The only challenge is that this would need to be refreshed/updated somewhat periodically.

seivan commented 9 months ago

Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as String.to_charlist, and feeding that into the LLM made it worse.

Can I bother you for the exact details on how you implemented the RAG here? What was the exact search method? The reason I ask, is because capturing meaning (at least in this case) matters. If your query is "what is " and the documentation says "A charlist is...." then that paragraph should probably have been included in your context, or am I way way off?

image

Edit: I just want to point out, I actually don't disagree, just wondering what went wrong.

josevalim commented 7 months ago

Hi folks, we have a rough plan here.

Code completion (#1)

We are still exploring options for code completion. I would really love if this is something we can run in Elixir itself and people could run it on their machines or their own servers trivially (assuming at most 3b params). Our preferences (in order) are:

Chat, RAG and functions (#2, #3, #4)

With OpenAI latest announcements, they have an offering (Assistants) that address problems 2, 3, and 4 for us, which makes it very compelling. Therefore, it is likely we will move forward by integrating with anything that exposes OpenAI assistant API. This means: