sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Searching for code hosts that require SSH is slow #19419

Closed eseliger closed 2 months ago

eseliger commented 3 years ago

Currently, this store method in batch changes requires to be able to read the external_service_repos.clone_url to determine whether a repo is cloned using SSH. Since that column may be encrypted in the future, a text search over it is not future-proof. Also, it is slow on Sourcegraph.com. Adding a separate column to the repo table that stores a boolean whether it uses SSH should help.

tsenart commented 3 years ago

@eseliger: I wouldn't store an extra column in the repo table, but rather normalize certain parts of the external_service.config field into separate columns (i.e. url, in this case).

eseliger commented 3 years ago

Is there a delay between an update to the external_services.config field and the URL being used in gitserver? If so, how long is it usually? Or would gitserver also read the URL from the DB? And if so, it probably currently reads the external_service_repos.clone_url field? Is that updated atomically when external_services.config is updated?

Also, what happens currently when a repo is contained in 3 external services, one specifies https, one ssh, and one no value? 😬

tsenart commented 3 years ago

Is there a delay between an update to the external_services.config field and the URL being used in gitserver?

Currently, yes. The syncer needs to run and update the external_service_repos.clone_url column.

If so, how long is it usually?

There's a default in site config, but it's configurable by site admins. repoListUpdateInterval

Or would gitserver also read the URL from the DB? And if so, it probably currently reads the external_service_repos.clone_url field?

gitserver already reads the clone URL from the DB, but it currently comes from the external_service_repos.clone_url column. #19420 would change that to read it from the external_services table and construct it at read time with the token.

Is that updated atomically when external_services.config is updated?

Yes, see above, by the syncer (it takes a while).

Also, what happens currently when a repo is contained in 3 external services, one specifies https, one ssh, and one no value? 😬

We currently pick a random one in gitserver, but we could easily change this logic to incorporate some heuristics. What do you think those should be?

https://github.com/sourcegraph/sourcegraph/blob/3102604f8152882c7361f93cefe1839e5f2567e0/cmd/gitserver/main.go#L87

eseliger commented 3 years ago

Update: this has been worked around and now the deprecation won't break this query anymore. We likely still want to improve the performance of this query, but I think for reasonably large customer instances it's not too bad. (Was 13s on dotcom if I remember correctly, where we have a few million repos).

tsenart commented 3 years ago

May be worth adding an index to the clone_url column then.

bahrmichael commented 2 months ago

@eseliger Is this still relevant?

eseliger commented 2 months ago

probably but it hasn't fallen on our feet in forever, so let's close it 🙂