raymyers commented 4 months ago

Background

SWE-bench has "assisted" and "unassisted" scores. Assisted means you are told what files to modify. Devin is presumed to have the SOTA score of 14% unassisted. Claude 3 Opus model with no agent achieves a strong 11% assisted score, as reported by zraytam. This means that a standalone "Oracle-substitute" that only guessed the relevant files to modify could get us well on the way.

Direction

A proposed solution @AtlantisPleb involves building the Oracle-substitute on Retrieval-Augmented Generation (RAG) process.

Cloning the relevant repository.
Vector embedding all files.
Generating one-line summaries for each file.
Performing cosine similarity search over those files using a vector of user input.
Identifying the top-k most likely files for modification.
Looping through each candidate file to determine if it should be patched.

This approach could potentially match Devin's scores within the next 1-2 days by evolving the process with more sophisticated LLM prompts or smarter codebase traversal methods.

Considerations

Generating embeddings for full repositories is resource-intensive but can be optimized by doing it once per repo in SWE-bench and reusing the embeddings. We are considering hosting a shared copy for experimentation using a service such as Pinecone.

The repos are not in one state. Every exercise starts at different point in time (git hash). Perhaps we can avoid redundant processing with tagging, this needs to be fleshed out.

Related Resources

Previous code crawler implementation: softgen.ai code parser
SWEEP repository contains both the chunking algorithm and the File Line Editor in Python, which can be reused.
Mirko.AI issue 11 "Abilities v2: Code Snippet Retrieval & File-Line Diff Editor (WIP Issue)"
pierrebhat hackathon code

AtlantisPleb commented 4 months ago

Cool. Thankfully we should only need to vector-embed one snapshot of each codebase: at the commit hash specified in the SWE-bench dataset. Embeddings can use standard Pinecone (or whatever) metadata/tags to record the associated commit hash in case we need that in the future.

phact commented 4 months ago

Took a stab at this https://github.com/raymyers/swe-bench-util/pull/2 lmk your thoughts.

raymyers commented 4 months ago

Added get oracle command

phact commented 4 months ago

I saw in your other repo @AtlantisPleb that you're making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit. Thoughts? Do you have any observations yet on how recall performs based on the descriptions instead of the chunked code yet?

raymyers commented 4 months ago

@phact

making LLM descriptions for each file. Intuitively it seems like these may be as useful as the chunked embeddings if the goal is just to select the right files to edit

This seems like a promising direction. My experience was that summaries worked very well in experiments last year, when I was trying to get GPT-4 to identify the relevant function in DukeNukem3D code.

What I was comparing at the time were opensource embeddings (krlvi/sentence-msmarco-bert-base-dot-v5-nlpl-code_search_net, vs one line GPT-3.5-Turbo summaries per file chunk, feeding many summaries directly to the LLM to choose which area to "zoom in" on. So not an apples-to-apples comparison and not the new OpenAI embeddings which might perform better on code, not sure.

I made a video about it - it was April 2023, might not be that useful anymore. Point being summaries showed a lot of promise.

raymyers / swe-bench-util

Generate embeddings from SWE-bench repos #1

Background

Direction

Considerations

Related Resources