Closed tsenart closed 8 months ago
Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.
For some vague tips on storing large data in Postgres (@sourcegraph/code-intel's lsif_references
table is an order of magnitude larger at its current scale without problems):
SELECT COUNT(*) FROM lsif_references
. We have tricks related to keeping materialized counts when aggregation is needed.FYI @rvantonder @camdencheek
How will we deal with "dynamic" refs? For example, HEAD~1
or refs/heads/cc/*
. Both of these require git-specific logic to handle. We currently support both of these because we pass the ref straight to git. It's also not clear to me whether we can detect whether a ref is "dynamic" ahead of time, but that might not be a big deal if we just fall back to asking gitserver
if the ref doesn't exist in the db.
How will we deal with "dynamic" refs? For example, HEAD~1 or refs/heads/cc/*
We would have to fallback to asking gitserver to resolve dynamic refs like HEAD~1
. Globs, however, we can support with a database lookup — we just need to convert a glob to a regex and have an appropriate index.
For specific commits we would also fallback to asking gitserver, and could return an error if the repo filters match more than one repo.
Nice. I like it. Looking forward to a faster ResolveRevision
😄
Any thoughts on how to best handle partial ref names? My understanding is, when you do something like git checkout my-branch
, git
automatically prepends that with refs/heads/
to make it refs/heads/my-branch
. Would we store the full refs/heads/my-branch
in the ref
column, or just my-branch
? I guess if we store refs/heads/my-branch
, then if users search for the regex pattern my-branch
, it would still be returned, and they can search ^refs/heads/my-branch$
for the fully-qualified form?
Just thinking out loud here because a question came up about searching branches by regex and I think this might solve that issue tidily.
Yep, we could store the full path and add appropriate indexes.
By the way, I don’t have an ETA to work on this right now since it doesn’t really fit with Q4s mono repo focus for search core.
On Fri 3. Dec 2021 at 20:24, Camden Cheek @.***> wrote:
Any thoughts on how to best handle partial ref names? My understanding is, when you do something like git checkout my-branch, git automatically prepends that with refs/heads/ to make it refs/heads/my-branch. Would we store the full refs/heads/my-branch in the ref column, or just my-branch? I guess if we store refs/heads/my-branch, then if users search for the regex pattern my-branch, it would still be returned, and they can search ^refs/heads/my-branch$ for the fully-qualified form?
Just thinking out loud here because a question came up about searching branches by regex and I think this might solve that issue tidily.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sourcegraph/sourcegraph/issues/28476#issuecomment-985770768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQPD4FDZHDYXJLPUINOMTUPEKP5ANCNFSM5JHKDRFQ .
This issue has been inactive for a long time. To reopen the ticket, please let us know how to reproduce the issue on latest main. For feature requests, please let us know what is still missing.
Context
Whenever searching non default revs (i.e. anything else than
rev:HEAD
) over many repositories, we need to ask gitserver if the given revision is valid for that repo. When searching over a large repo subset, this incurs a big latency cost, specially before #28475 is implemented. Even after is is implemented, when searching over non-indexed revisions, we'd still incur this cost, because we wouldn't be able to do a global search (i.e. zoekt only).Additionally, the
repohascommitafter:
filter is currently also broken at scale in the same way — 1 request per rev per repo to gitserver.To speed all of this up, we can index each repositories refs in Postgres and pass along the
rev
andrepohascommitafter
args to the database query that resolves repositories.A rough schema such as this would be a good start:
With the current set of repos on sourcegraph.com, this would become our largest table in the frontend db, with roughly 80M rows. Validating performance and experimenting with different schemas in a clone of the production DB would be the first step in this project.