Better understanding of function names/class names

last-partizan commented 1 week ago

Thanks for this project, it looks really promising.

I just started using it, and here's what I found, example is this repo:

> gt 'data_file_path' --context 0 --max-results 3
─────────────────────────
File: seagoat/utils/server.py
─────────────────────────
def _get_server_data_file_path() -> Path:
    path = _get_server_data_file_path()
    write_to_json_file(_get_server_data_file_path(), servers_info)

But, when i split name in into words, it cannot find this function.

> gt 'get server data file path' --context 0 --max-results 3
────────────────────────────────────────────────
File: seagoat/server.py
────────────────────────────────────────────────
    logging.basicConfig(level=logging.INFO, format="%(asctime)s %(message)s")
────────────────────────────────────────────────
File: docs/server.md
───────────────────────────────────────────
called `cacheLocation` which contains the path to the cache directory for
each different type of cache associated with that project.

That's probably my main use case - to find something without knowing exact name. I'm happy to help with fixing this and doing some research or writing patches.

Maybe you have idea how to impove this? I see there's #354 issue, about trying different models. And probably some other "code-search-oriented model" can improve this.

My absolutely non-ai-related guess, is when encountering snake_case or SomeOtherCase names - convert them to normal words and let it index this. But probably code-search-related models already doing it...

last-partizan commented 1 week ago

Oh, well.

It can use other embedding functions, and I tried some ollama models with different level of success, and then WordLlama.

https://github.com/chroma-core/chroma/pull/2925

It looks promising, and it's lightning-fast even laptop with AM Ryzen 5 4500U.

kantord commented 5 days ago

I think that what is going on here is that the sorting mechanism is not perfect, it is not based on an actual understanding of the query and the results.

Like you say, using a better embedding model can help with this, as the semantic distance is one of the main criteria for sorting. So we should definitely experiment with that.

Another thing to add is that I believe that there are different potential ways of using. For instance in your use case the result would have probably shown up somewhere towards the top of the list, but not on the very top. This could be improved by using an LLM also to understand the query: for instance using a RAG workflow we could get a list of results that fits into the context limit of a local ollama model, and use the ollama model to formulate the final result. The upside of this would be that you don't have to "manually" peruse several lines of results to find what you are looking for. Downside would be that this model could hallucinate or format the answer incorrectly - could be addressed by some validation.

Yet another thing is to improve the chunking: currently we use the actual code lines (based on some heuristic to ignore irrelevant lines) as well as the file names to create the embeddigs. Instead of this, we could use a generative model to actually understand the function of the code line and add additional context to the embedding. This should be fairly simple to do, but would greatly slow down the chunking process. But if we actually have a faster model now, it would be a good time to experiment with it.

last-partizan commented 4 days ago

Instead of this, we could use a generative model to actually understand the function of the code line and add additional context to the embedding.

I was thinking that embedding function is supposed to do this.

Maybe, it could be achieved by using larger chunks, probably function/class or some other top-level structures with a model like this?

https://huggingface.co/jinaai/jina-embeddings-v2-base-code

kantord / SeaGOAT

Better understanding of function names/class names #709