search: Index git refs to speed-up rev validation and filtering

tsenart commented 2 years ago

Context

Whenever searching non default revs (i.e. anything else than rev:HEAD) over many repositories, we need to ask gitserver if the given revision is valid for that repo. When searching over a large repo subset, this incurs a big latency cost, specially before #28475 is implemented. Even after is is implemented, when searching over non-indexed revisions, we'd still incur this cost, because we wouldn't be able to do a global search (i.e. zoekt only).

Additionally, the repohascommitafter: filter is currently also broken at scale in the same way — 1 request per rev per repo to gitserver.

To speed all of this up, we can index each repositories refs in Postgres and pass along the rev and repohascommitafter args to the database query that resolves repositories.

A rough schema such as this would be a good start:

CREATE TABLE gitserver_refs (
    repo_id integer REFERENCES repo(id) NOT NULL,
    ref text NOT NULL,
    commit_id text NOT NULL,
    commit_date timestamptz NOT NULL
)

With the current set of repos on sourcegraph.com, this would become our largest table in the frontend db, with roughly 80M rows. Validating performance and experimenting with different schemas in a clone of the production DB would be the first step in this project.

github-actions[bot] commented 2 years ago

Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.

efritz commented 2 years ago

For some vague tips on storing large data in Postgres (@sourcegraph/code-intel's lsif_references table is an order of magnitude larger at its current scale without problems):

Keep the number of indexes as small as possible (but including/covering fields is fine).
Do mass updates when inserting or updating. Do not do a bulk delete+insert, but instead do an insert into a temporary table and update/insert/delete the exact set of tuples that need to change. This won't put unnecessary pressure on the autovaccum daemon for that table.
Don't try to do unfiltered counts or aggregates. It takes 45 minutes to do a SELECT COUNT(*) FROM lsif_references. We have tricks related to keeping materialized counts when aggregation is needed.

tsenart commented 2 years ago

FYI @rvantonder @camdencheek

camdencheek commented 2 years ago

How will we deal with "dynamic" refs? For example, HEAD~1 or refs/heads/cc/*. Both of these require git-specific logic to handle. We currently support both of these because we pass the ref straight to git. It's also not clear to me whether we can detect whether a ref is "dynamic" ahead of time, but that might not be a big deal if we just fall back to asking gitserver if the ref doesn't exist in the db.

tsenart commented 2 years ago

How will we deal with "dynamic" refs? For example, HEAD~1 or refs/heads/cc/*

We would have to fallback to asking gitserver to resolve dynamic refs like HEAD~1. Globs, however, we can support with a database lookup — we just need to convert a glob to a regex and have an appropriate index.

For specific commits we would also fallback to asking gitserver, and could return an error if the repo filters match more than one repo.

camdencheek commented 2 years ago

Nice. I like it. Looking forward to a faster ResolveRevision 😄

camdencheek commented 2 years ago

Any thoughts on how to best handle partial ref names? My understanding is, when you do something like git checkout my-branch, git automatically prepends that with refs/heads/ to make it refs/heads/my-branch. Would we store the full refs/heads/my-branch in the ref column, or just my-branch? I guess if we store refs/heads/my-branch, then if users search for the regex pattern my-branch, it would still be returned, and they can search ^refs/heads/my-branch$ for the fully-qualified form?

Just thinking out loud here because a question came up about searching branches by regex and I think this might solve that issue tidily.

tsenart commented 2 years ago

Yep, we could store the full path and add appropriate indexes.

By the way, I don’t have an ETA to work on this right now since it doesn’t really fit with Q4s mono repo focus for search core.

On Fri 3. Dec 2021 at 20:24, Camden Cheek @.***> wrote:

Any thoughts on how to best handle partial ref names? My understanding is, when you do something like git checkout my-branch, git automatically prepends that with refs/heads/ to make it refs/heads/my-branch. Would we store the full refs/heads/my-branch in the ref column, or just my-branch? I guess if we store refs/heads/my-branch, then if users search for the regex pattern my-branch, it would still be returned, and they can search ^refs/heads/my-branch$ for the fully-qualified form?

Just thinking out loud here because a question came up about searching branches by regex and I think this might solve that issue tidily.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sourcegraph/sourcegraph/issues/28476#issuecomment-985770768, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQPD4FDZHDYXJLPUINOMTUPEKP5ANCNFSM5JHKDRFQ .

stefanhengl commented 8 months ago

This issue has been inactive for a long time. To reopen the ticket, please let us know how to reproduce the issue on latest main. For feature requests, please let us know what is still missing.

sourcegraph / sourcegraph-public-snapshot

search: Index git refs to speed-up rev validation and filtering #28476

Context