If you introduce or remove a gitserver replica, the consistent hash on repo name means almost all repositories will be reassigned to another gitserver (example) which has negative consequences like:
Most repositories will be recloned from the code host
Most searches will remain fast (no re-indexing will be needed), but search results may load a bit slowly while repositories are cloning.
Unindexed searches (non-master branches, commit/diff search, etc.) may be slower while repositories re-clone
Users visiting repositories directly on Sourcegraph may be prompted to wait a few seconds while the repository reclones
Example:
you have 10,000 repositories across 3 gitserver instances:
gitserver-1 contains repos 0 to 3,333
gitserver-2 contains repos 3,333 to 6,666
gitserver-3 contains repos 6,666 to 10,000
You introduce a new gitserver-4, something like the following will happen:
gitserver-1 now begins cloning repos previously assigned to gitserver-2
gitserver-2 now begins cloning repos previously assigned to gitserver-3
gitserver-3 now begins cloning repos previously assigned to gitserver-1
gitserver-4 now begins cloning 1/4th the repositories
The load will be even in the end, with each having 1/4th, but gitservers 1, 2, and 3 had their repositories unavailable for a period of time because everything got shuffled around and they had to reclone everything. What would be better (and what indexed-search does) is merely shift 1/4th the load to the new 4th replica, without the original replicas (effectively) starting from scratch (i.e., they take into account the data they already have).
Additionally, if a gitserver replica goes down for an extended period of time it becomes an outage of that entire subset of repositories, instead of the load rebalancing across shards.
indexed-search does not have these same issues, because it shards based on the hostname. We should do the same for gitserver - but care must be taken to ensure we respect the existing sharding scheme or migrate it appropriately so there is no service degradation for instances upgrading to this new scheme.
If you introduce or remove a gitserver replica, the consistent hash on repo name means almost all repositories will be reassigned to another gitserver (example) which has negative consequences like:
Example:
Additionally, if a gitserver replica goes down for an extended period of time it becomes an outage of that entire subset of repositories, instead of the load rebalancing across shards.
indexed-search does not have these same issues, because it shards based on the hostname. We should do the same for gitserver - but care must be taken to ensure we respect the existing sharding scheme or migrate it appropriately so there is no service degradation for instances upgrading to this new scheme.