kantord / SeaGOAT

local-first semantic code search engine
https://kantord.github.io/SeaGOAT/
MIT License
969 stars 62 forks source link

Monorepo/big repo support #2

Open kantord opened 1 year ago

kantord commented 1 year ago

This is a list of ideas to support huge repositories and monorepos

cori commented 11 months ago

I'm curious what your definition of "huge" is here.

Context: I've been playing with SeaGOAT against two private repos - one with ~4.5k files and one with ~185k. I've not found gt useful yet against the 185k repo - it returns way too much info to be helpful and takes ~3m to return from a query. I'm going to reindex excluding all of our tests to see if that helps, but am wondering if this repo is just not a good use-case for SeaGOAT.

I am connecting to a seagoat-server running on a barely-remote host (a reasonably-beefy "gaming NUC" on my local network), but that doesn't seem likely to be a huge issue - I'd anticipate any network latency to be (perhaps more-than) offset by the additional resources it has vs running on my somewhat-loaded MBP.

kantord commented 11 months ago

I'm curious what your definition of "huge" is here.

Context: I've been playing with SeaGOAT against two private repos - one with ~4.5k files and one with ~185k. I've not found gt useful yet against the 185k repo - it returns way too much info to be helpful

I would say that in theory it should be possible to make SeaGOAT useful in theory even for 500K file repos. It should be just a question of optimization. Of course analyzing that many files would take a very long time even with any sort of optimization, however with nice heuristics it should be possible to make sure that most files that the user would be looking for will be quickly analyzed.

Another thing is that it should be possible to either optionally offload the effort to a remote server or even distribute the effort in a P2P fashion. I have not done much in this area though because first I want to learn more about the use cases and optimize the results and the analysis process. That is because each time the process for analyzing the chunks is changed, of course the entire project needs to be reanalized, so I want to hopefully do most of that early on in the project and minimize such changes in the future if there is a large user base

edit: and I think that in the end almost arbitrary repo size should be supportable at least if a distributed service is built. Something along the lines of distributing the repo/multiple repos between multiple services and compiling the results at the end

kantord commented 11 months ago

takes ~3m to return from a query.

Regarding this, does it also happen if you do it while limiting the number of results? For example -l30?

I am considering limiting the number of results by default, and actually making the users make it manually unlimited. Also perhaps in the server config there could be a hard limit to avoid making the server hang and such. Or perhaps a timeout for queries or such.

kantord commented 11 months ago

Also for remote running, there should be better multi-core support I think. That will not make a difference with low-load situation but should at least help with higher load which I guess would be important if you are actually experimenting with something for hundreds of thousands of files

kantord commented 11 months ago

Regarding the issue with slow queries in big repositories, there should be dramatic improvements now, I have seen queries speeding up from 5-7 seconds to under 1s