sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Search results over historical SHAs sometimes get capped with "1k+" plus (and also breaks code insights) #21998

Closed Joelkw closed 3 years ago

Joelkw commented 3 years ago

A customer is hitting a problem (that began with code insights) where their historical searches are sometimes capped at 1k results instead of returning the full results set (which renders their insights useless).

Full customer details / cases / examples here – what we know so far:

Is this intentional behavior, and if so, how can we work around it for code insights to work? If not, what's the fix we/they need?

Unable to reproduce this on dotcom or k8s myself.

stefanhengl commented 3 years ago

I think I can reproduce it on dotocom and locally: with sha (13.2k matches) v. head (20.9k matches). This is the same issue, isn't it?

We are hitting different search paths (indexed for HEAD v. unindexed for any other sha). However, the result counts shouldn't be that different. I am looking into it.

stefanhengl commented 3 years ago

Unindexed search has a hard coded limit of 1000 filematches, which explains the behavior. @rvantonder It seems as if the limit was introduced for structural search a long time ago. Can/Should we increase it?

The limit protects us from OOMing because searcher is still batch-based. It seems that the only permanent fix would be to enable searcher for streaming.

rvantonder commented 3 years ago

@rvantonder It seems as if the limit was introduced for structural search a long time ago.

It wasn't introduced by that change, that change/blame just moves the harcoded limits (it was there before for search_regex.go)

Can/Should we increase it?

Unsure, I don't have an informed opinion. Maybe @keegancsmith ?

where their historical searches are sometimes capped at 1k results instead of returning the full results set (which renders their insights useless).

So the question is, is this a true regression, or is it just the hardcoded limit on unindexed search that has always historically been in place, and is just triggering the limit on certain search patterns?

stefanhengl commented 3 years ago

I don't think it is a regression, it has been there for years. I wouldn't even call it a bug necessarily, but a limitation of batch-search. We can raise the limit but this means we will just hit the limit at some later time. This will always show whenever compare searches of common patterns in indexed vs. unindexed search.

@Joelkw How many historical shas does the customer want to search? They could use mutli-branch-indexing to put all those queries on the same code path. However this is pretty manual. There is no way to index the last N commits automatically.

I will talk with the team to see if we can put Searcher on the roadmap for streaming search.

Joelkw commented 3 years ago

@stefanhengl thanks for the response.

How many historical shas...?

I can check with the customer and see if I can determine why the search is unindexed. You can check out the search here and I'm a bit surprised this is the case – it's on what seems like a pretty standard repo/one that I expect is indexed.

In general, I think I need a better understanding of when/why a customer might hit his. On Insights running over Sourcegraph/Sourcegraph, I see many results >1k (attached a quick simple example). Is this just a matter of making sure a customer has all their repos indexed? We know that exhaustive and exact historical search is key to Code Insights.

image

stefanhengl commented 3 years ago

@Joelkw The limit is per searcher and applies to files, not matches directly. In the simplest case in which you have 1 match per file and just 1 searcher deployed we have a hard limit of 1000 matches. The more searchers you deploy and the more matches there are within a file the more results you will get. You can see this in the queries I linked in my first response. We return 13.2k matches, because many files have more than 1 match and we have 4 searchers deployed.

I know this is not a satisfactory answer but the only permanent fix is not enable enabling searcher for streaming which is something we have to put the roadmap. A quick fix would be to deploy more instances of searcher.

The problem is masked whenever you compare indexed and unindexed search for searches with fewer than 1k hits.

Joelkw commented 3 years ago

@stefanhengl thank you for the additional context. Appreciate the explanation.

Does this mean the only fix for the customer is to wait until we don't enable searcher for streaming? Or is there a workaround fix in the meantime (like getting the repo indexed)? Or is deploying more instances of searcher a realistic solution? (It seems like it's not, because it just bumps up the limit, but it's not immediately clear how high the limit would need to bump/how many searchers you need, if I understand you correctly.)

We do need a permanent solution here, because exhaustive search counting over huge results sets (think running over 8k+ repos or more). How can I help get that prioritized/should I speak to anyone else or write up a doc on our needs here?

stefanhengl commented 3 years ago

I had a typo in my previous comment, that completely reversed the meaning (not enable -> enabling). The only permanent fix is not to have a limit and the only way not to have a limit is to stream back results immediately instead of buffering them in searcher. I will bring it up in the next search sync, we don't have it on our road map right now.

stefanhengl commented 3 years ago

One way of moving foward quickly is if code-insights modified searcher. Search-core could assist with design and reviews? WDYT?

Joelkw commented 3 years ago

@stefanhengl Ah, thanks for the update there.

The code insights team would definitely have to move forward with this sooner rather than later since it seems that it blocks the main use case of code insights to "get exact readings of a search result in your code over time."

Unfortunately, the code insights team is also at the moment is not even a "minimum viable team" (we are actively hiring both 1 frontend and 1 backend engineer) and we have very limited capacity to add anything beyond our lean roadmap, given the size of the team. Let me check with the team and I'll direct them to followup on this issue (or will followup here myself) today.

Do you (@stefanhengl) have a sense of how much work/engineering time it would be modify searcher? (Even a T-shirt size is helpful.)

Joelkw commented 3 years ago

As for how to unblock the customer right now, other than modifying searcher, are there any other actions the customer could take around indexing?

stefanhengl commented 3 years ago

@camdencheek worked on enabling streaming for searcher. @Joelkw Can we close this issue?

keegancsmith commented 3 years ago

@camdencheek if I am not mistaken you have been tackling this / have fixed this. I'm gonna assign over to give ya the pleasure of hitting close. Please assign back if I am mistaken :)

camdencheek commented 3 years ago

@keegancsmith Confirming it should be fixed! Closing with pleasure πŸ˜„