simonw / symbex

Find the Python code for specified symbols
Apache License 2.0
231 stars 6 forks source link

Fuzzy search with code embeddings? #42

Closed walking-octopus closed 8 months ago

walking-octopus commented 10 months ago

There are models available like CodeBERT and if efficient CPU inference is a target and you don't mind some quirks, bert.cpp may be an excellent GGML based inference framework.

I think this can enable queries based on intent or approximate names. A symbol like get_status may be fuzzy matched to fetchSystemState or maybe even a natural language query like "how is the CSV parsing implemented?" would yield some good enough results (possibly allowing a Bash script with some React pattern, where the LLM uses a code-search tool to fetch context).

simonw commented 8 months ago

I like this as an ability, but I think it's something that should live in an additional tool rather than being baked into Symbex itself.

I actually use Symbex along with my LLM embedding tool for semantic search - building a search index that's populated using Symbex. I wrote more about how I do that in the release notes here: https://github.com/simonw/symbex/releases/tag/1.4