insights: retroactively update repo names in TimescaleDB

slimsag commented 3 years ago

Since we expect insights to be e.g. filtered down by repo name (or a regex over the repo name, like to find insights for a specific org), the DB stores three fields (dbschema):

The repo_id, which is the ID of the repo (irregardless of any renames) as known by the main app DB.
The repo_name string, which the repo was named at the time the datapoint was recorded. We will use this to regexp search for data points with repo_name matching some regexp. The idea is that there would be a background worker which goes through the DB and asks the main app DB (via RepoStore.GetByID()) what the current name of the repo_id is and updates this field retroactively, thus it is possible to query based on the current name of the repo (generally speaking.)
The original_repo_name string, which is exactly the same as repo_name except that it will not be retroactively updated. This is useful because you might wish to see that e.g. an insight's data changed substantially as part of a major renaming effort that went on. In this case, some data points would show the old repo name and some data points would show the new repo name (because original_repo_name is the name of the repo at the time the data point was recorded)

Everything described above is implemented, and all the fields described above are being recorded - but the background worker which updates repo_name to match the latest-known name for the repo is not:

The idea is that there would be a background worker which goes through the DB and asks the main app DB (via RepoStore.GetByID()) what the current name of the repo_id is and updates this field retroactively, thus it is possible to query based on the current name of the repo (generally speaking.)

We should implement that, or if we don't care about repo renames ditch it and just have a single repo_name field.

github-actions[bot] commented 3 years ago

Heads up @joelkw @felixfbecker - the "team/code-insights" label was applied to this issue.

coury-clark commented 3 years ago

A few notes:

repo_id is designed to be coherent in the primary postgres DB, even through repository renames
Insight data is stored and associated with the repo_id fetched from the primary DB

Given this, we can imagine a time series T such that it started with repo name first and eventually changed to second | -----(first)------ | =====(second)===== | In this case, all of the underlying data would still be associated with the same repo_id, but the timeseries would map to multiple repo_name entries. This is acceptable because when we query by repo_name regex we match any insight series that contained the name, which would return the data that matched the original repo_name.

With this in mind, we may be able to deprecate the original_repo_name field entirely, without the need to perform updates on those entries.

sourcegraph / sourcegraph-public-snapshot

insights: retroactively update repo names in TimescaleDB #19196