Open josevalim opened 11 months ago
@josevalim I tested all the suggested solutions: Codeium, Copilot, OpenAI, Claude. None of them will be useful with Elixir(btw Codeium is the worst)... In my opinion it will be best to take small open source model and train it explicitly for Elixir.. We can host it on huggingface for free. I think (and this assumption must be validated) that the huggingface rate limit for open source project will be sufficient for existing Livebook users... Anyway.. I'm here to help with any direction you will choose...
@vitalis for completeness, you tested for which particular categories above? How were your tests done and how did you assess quality?
And, of course, even if we decide to train our own model, there is no guarantee it will perform better. For code completion there are smaller models available but how would we proceed to train llama2 or similar and which data + resource do we need for such?
Thanks!
Code completion: None of the solutions is helpful with Elixir, I tested Copilot before you spoke about this idea on twitch stream and tested all other solutions in the past days... In my (subjective) opinion none of them will provide what YOU expect from code completion, they will do more harm than good.. Chat-based: ChatGPT(4) is the best solution (from the suggested solutions), but it's $20 a month.. I'm not sure many people will pay this in order to work with Livebook / learn Elixir Semantic-search: again... none of the solutions is helpful with Elixir
From my understanding it will be really difficult to optimize llama2 model or similar, llama2 is a huge model and it will require a lot of GPU resources and time and it will be very expensive...
At this moment I don't have a solution.. but I have an idea I want to research next week and I'll update you
If you did chat-based with GPT you wouldn't be using the subscription, you'd have users plug in a pay-as-you-go API key, right? Lots and lots of people are already using GPT for way more than Livebook, so I don't think a GPT token requirement would be the limiter in traction.
For the record, the chat-based workflow has been a tremendous accelerant to my Elixir app development, even without a built-in integration with Livebook.
@michaelterryio It's not just "plug in a pay-as-you-go API key"... it requires development.. I agree that chat-based workflow can be major improvement in app development My (subjective) opinion is: you don't need Livebook/GPT integration in order to use GPT as chat-based workflow.. Again, in my (subjective) opinion if time will be invested in integration with any LLM it should be something game changing... and not something mainstream.. the trend today is adding GPT generator to every app.. and from what I see it's useless... (using GPT directly and not via integration is much better)
@josevalim Tested all other available models: CodeWhisperer, StarCoder, WizardCoder...and many more I'm testing the models with a simple task of writing "bubble sort" The worst model at this moment is CodeWhisperer, Elixir is not supported at all.. But here is an example of bubble sort by WizardCoder:
Didn't test replit.. to complex and no free plan..
I'll continue the research next week
StarCoder, WizardCoder, and the replit one would have to be fine-tuned first. I am not expecting them to give good results out of the box.
@josevalim Is there existing dataset for Elixir?
I think 1 and 2 are related. You could create embeds of all the cells in a notebook and store them in some vector table (sqlite?). When querying the chat, create embeds of the query and previous context and use that. This would let you at least ask questions about your notebook. Which is useful, not sure how impactful it would be. Though not sure how useful if the models don't understand Elixir well enough :/
Hey folks, just joining the discussion from #2222
A few thoughts from my side
In addition to the types of LLM functionality you mention @josevalim, I'd add two others
ChatGPT "Advanced Data Analysis" and open interpreter go beyond a simple chat panel and implement a full "development loop". You'd be able to say e.g. "I want to generate all strings with 5 characters that aren't english language words but are easy to pronounce" (aka pseudowords) and it works out how to do that. Both open interpreter and advanced data analysis did that for me almost entirely unassisted
Another small quality of life feature, is automatically fixing errors and warnings. In Cursor I can just click "Fix with AI" on any error or warning and it works about 50% of the time
The only model that I've tried that gives me really good results for Elixir is GPT-4. I use it dozens of times every day in the Cursor editor
With GPT-4, retrieval augmented generation (storing embeddings of documentation and code in vector DB and pulling them into prompt context) is a mixed bag so far
The main problem with retrieval augmented generation and GPT-4 is the relatively small context window. So you need to be very sure that the context is relevant enough to include in the prompt, and that means you end up omitting a lot of relevant context
Anthropic Claude is much better at considering large contexts (and it's much faster!), but can't reason about code very well. I hear that the next version of Claude closes that gap and I'm pretty stoked for that. In general, once the coding ability is "good enough" in a model, speed becomes the most important feature. The faster the model is, the lower the threshold for actually using it
My sense is that a model that can already program and reason about code really well (like GPT-4, the upcoming version of Claude and possibly future larger versions of code llama or similar) would benefit a lot from Elixir specific fine-tuning (compared to just retrieval augmented generation). The big benefit elixir (and phoenix) ecosystem has in this regard is that everything is incredibly well documented and there is a good canon set of information that you can fine-tune on (e.g. a docs of elixir, phoenix and most commonly used packages, OSS projects like livebook, phoenix sample apps, elixir forum/slack/discord posts, etc)
Another useful source of data for fine-tuning would be human feedback coming from feedback buttons on generated content within livebook or on an "AI bot" that lives in elixir forum, slack, discord and github. Again, this is an area where the small, close-knit elixir community with a benevolent dictator may be able to do things that other communities can't.
If the end-result is a model that can do most data analysis tasks end-to-end in a livebook or implement entire product features in a phoenix app, I don't think fine-tuning would be cost-prohibitive, even if it costs several hundred thousand dollars
As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.
Let me know if there's anything specifically you'd like me to ask those folks
Apologies for the long ramblings, I'm awake in the middle of the night and severely jetlagged 🤣
Hi @jonastemplestein, that was a great report. :)
[RAG] doesn't work super well for "codebase level" questions in large codebases.
That has also been my experience, even before we reach the LLM layer. Indexing all of the Elixir API docs and then performing an embedding search did not provide good results. Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as String.to_charlist
, and feeding that into the LLM made it worse.
You can then "@-mention" the documentation in your LLM prompt and Cursor tries to find relevant bits of documentation and adds them to the context.
This is an amazingly cool idea, being able to "@-mention" the documentation. And given how Elixir stores docs, it should be trivial to include the documentation of something. The only problem I can think here is about guides: we can programmatically fetch API docs but not its guides. Although I would assume for something as large as Ecto or Phoenix, a lot of knowledge already exists in the neural network, because you can't pass all of its docs or guides in the context.
Perhaps initially we should focus on supporting some sort of @-mention and passing the notebook itself as context.
That said, I will strike out the RAG idea from the initial issue description, I believe we should focus on two separate problems: chat-based interfaces and code completion. For the chat-based interface (and everything that falls out of it, such as explaining code, explaining exceptions, refactoring code, automatically fixing errors, etc), we will likely need to use an external LLM, due to the required size. But I still hope we can run our own model for code completion, fine-tuned to Elixir.
As it happens, I'm in SF this week chatting to folks at both OpenAI and Anthropic about a number of things, and I'm already planning to ask them how we could fine-tune an elixir-only model.
We can fine tune any of them to Elixir, it would be amazing, and it would definitely help us gear towards a decision. Let me know if there is anything I can do to help. We can also help with collecting the data. Both companies probably already have crawlers which they could point to hex.pm but we can also try to build a large archive with all API references from Hex.pm. The only challenge is that this would need to be refreshed/updated somewhat periodically.
Asking what is a charlist did not going bring the relevant section in a moduledoc but more often it would bring a specific function, such as
String.to_charlist
, and feeding that into the LLM made it worse.
Can I bother you for the exact details on how you implemented the RAG here?
What was the exact search method?
The reason I ask, is because capturing meaning (at least in this case) matters. If your query is "what is
Edit: I just want to point out, I actually don't disagree, just wondering what went wrong.
Hi folks, we have a rough plan here.
We are still exploring options for code completion. I would really love if this is something we can run in Elixir itself and people could run it on their machines or their own servers trivially (assuming at most 3b params). Our preferences (in order) are:
With OpenAI latest announcements, they have an offering (Assistants) that address problems 2, 3, and 4 for us, which makes it very compelling. Therefore, it is likely we will move forward by integrating with anything that exposes OpenAI assistant API. This means:
There are several tools that expose OpenAI API on top of open source models, and it is just a matter of time until they are updated to have the new assistant API on top of LLama, Mistral, etc. This would allow anyone to bring their own LLM too
Companies and users can bring their own OpenAI assistant into Livebook (which, for example, could include their own private docs for retrieval)
We want to integrate LLMs as part of Livebook itself. There are at least four distinct levels this can happen:
Code completion (may or may not need a LLM) (options: Codeium, Copilot, fine-tuned Repl.it model)
Chat-based (which also includes selecting code and asking to document it as well as explaining exceptions) (options: Codeium, Copilot, OpenAI, Claude)
~Semantic-search over all installed packages in the runtime (may or may not need a LLM) (options: Codeium, Copilot, OpenAI, Claude)~
~Function completion (we can allow kino/smart-cells to register functions which we hook into prompts, similar to HF Agents) (options: OpenAI)~
This is a meta-issue and we are currently doing proof of concepts on different areas. There is no clear decision yet. We will most likely allow users to "bring their own LLM" for most of these categories, especially 2-3-4 (either commercial or self-hosted).