sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

gitserver rebalancing/sharding logic should be smarter #11414

Open slimsag opened 4 years ago

slimsag commented 4 years ago

If you introduce or remove a gitserver replica, the consistent hash on repo name means almost all repositories will be reassigned to another gitserver (example) which has negative consequences like:

Example:

you have 10,000 repositories across 3 gitserver instances:

  • gitserver-1 contains repos 0 to 3,333
  • gitserver-2 contains repos 3,333 to 6,666
  • gitserver-3 contains repos 6,666 to 10,000

You introduce a new gitserver-4, something like the following will happen:

  • gitserver-1 now begins cloning repos previously assigned to gitserver-2
  • gitserver-2 now begins cloning repos previously assigned to gitserver-3
  • gitserver-3 now begins cloning repos previously assigned to gitserver-1
  • gitserver-4 now begins cloning 1/4th the repositories

The load will be even in the end, with each having 1/4th, but gitservers 1, 2, and 3 had their repositories unavailable for a period of time because everything got shuffled around and they had to reclone everything. What would be better (and what indexed-search does) is merely shift 1/4th the load to the new 4th replica, without the original replicas (effectively) starting from scratch (i.e., they take into account the data they already have).

Additionally, if a gitserver replica goes down for an extended period of time it becomes an outage of that entire subset of repositories, instead of the load rebalancing across shards.

indexed-search does not have these same issues, because it shards based on the hostname. We should do the same for gitserver - but care must be taken to ensure we respect the existing sharding scheme or migrate it appropriately so there is no service degradation for instances upgrading to this new scheme.

github-actions[bot] commented 3 years ago

Heads up @tsenart - the "team/cloud" label was applied to this issue.