raymyers / swe-bench-util

Scripts for working with SWE-Bench, the AI coding agent benchmark
Apache License 2.0
6 stars 2 forks source link

assistants API file upload #2

Closed phact closed 5 months ago

phact commented 5 months ago

Added a new cli command process-repo-files which clones the repo, switches to the right commit hash, and iterates [sequentially for now] over all the files, uploading them to assistants-api. This gets us embeddings and RAG for free:

python -m swe_bench_util process-repo-files

I clone to /tmp for now, not sure if that's what we want.

In the end it dumps a list of file_ids that you can give your assistant for searching.

If this is interesting I'm happy to help optimize how assistants chunks and embeds for our purposes.

raymyers commented 5 months ago

Thanks @phact! Looks like a minor rebase needed. At first glance I think this is pretty good start, just except a couple notes:

raymyers commented 5 months ago

I'm getting this for most files, maybe will need to do something to make OpenAI understand these are text files

Dockerfile: Error code: 501 - {'message': 'Unsupported file type'}
phact commented 5 months ago

I was thinking about what file types we should add. Maybe the play is to use a block list like sweep does:

https://github.com/sweepai/sweep/blob/bdcd1195bc8c2a90aa277a9169dcb13eed702868/docs/pages/blogs/generating-50k-embeddings-with-gte.mdx#L22-L26

raymyers commented 5 months ago

Merging, going to structure a bit and make my notes a new issue