raymyers / swe-bench-util

Scripts for working with SWE-Bench, the AI coding agent benchmark
Apache License 2.0
5 stars 2 forks source link

Generate embeddings from SWE-bench repos #1

Open raymyers opened 4 months ago

raymyers commented 4 months ago

Background

SWE-bench has "assisted" and "unassisted" scores. Assisted means you are told what files to modify. Devin is presumed to have the SOTA score of 14% unassisted. Claude 3 Opus model with no agent achieves a strong 11% assisted score, as reported by zraytam. This means that a standalone "Oracle-substitute" that only guessed the relevant files to modify could get us well on the way.

Direction

A proposed solution @AtlantisPleb involves building the Oracle-substitute on Retrieval-Augmented Generation (RAG) process.

This approach could potentially match Devin's scores within the next 1-2 days by evolving the process with more sophisticated LLM prompts or smarter codebase traversal methods.

Considerations

Generating embeddings for full repositories is resource-intensive but can be optimized by doing it once per repo in SWE-bench and reusing the embeddings. We are considering hosting a shared copy for experimentation using a service such as Pinecone.

The repos are not in one state. Every exercise starts at different point in time (git hash). Perhaps we can avoid redundant processing with tagging, this needs to be fleshed out.

Related Resources

AtlantisPleb commented 4 months ago

Cool. Thankfully we should only need to vector-embed one snapshot of each codebase: at the commit hash specified in the SWE-bench dataset. Embeddings can use standard Pinecone (or whatever) metadata/tags to record the associated commit hash in case we need that in the future.

phact commented 4 months ago

Took a stab at this https://github.com/raymyers/swe-bench-util/pull/2 lmk your thoughts.

raymyers commented 4 months ago

Added get oracle command

phact commented 4 months ago

I saw in your other repo @AtlantisPleb that you're making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit. Thoughts? Do you have any observations yet on how recall performs based on the descriptions instead of the chunked code yet?

raymyers commented 4 months ago

@phact

making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit

This seems like a promising direction. My experience was that summaries worked very well in experiments last year, when I was trying to get GPT-4 to identify the relevant function in DukeNukem3D code.

What I was comparing at the time were opensource embeddings (krlvi/sentence-msmarco-bert-base-dot-v5-nlpl-code_search_net, vs one line GPT-3.5-Turbo summaries per file chunk, feeding many summaries directly to the LLM to choose which area to "zoom in" on. So not an apples-to-apples comparison and not the new OpenAI embeddings which might perform better on code, not sure.

I made a video about it - it was April 2023, might not be that useful anymore. Point being summaries showed a lot of promise.