How do we know when to scale gitserver on sourcegraph.com?

slimsag commented 4 years ago

Currently, gitserver on sourcegraph.com sits at 10% disk usage on all instances and thus older repositories are being deleted. Slack thread

It is not clear on Sourcegraph.com exactly when we should scale gitserver to accommodate more repositories. Geoffrey suggests:

Some things that come to mind:

How does this manifest to the user?

What percentage of searches/etc. does this affect?

Has this percentage gone up / down recently?

Has our sourcegraph.com traffic noticeably increased?

How often are these older repos (that are being deleted) accessed?

Maybe we should define some metric / distribution that governs when we should scale our storage (e.g. we want to guarantee that x percentage of repos accessed with such and such frequency are always stored on sourcegraph.com)

Which are good starting points, but:

Can only be answered by a human
Can only feasibly be answered by a human (unless we want a blanket-statement/flakey "fire an alert if both searches are going slow and gitserver disk space < 15%")
Cannot be reasonably acted on except as a human
Cannot be reasonably acted on except as a human
Such a metric does not exist today, and it is unclear to me if it alone would be sufficient.

In other words, how do we alert ourselves to when action is needed here? Should we alert on this for sourcegraph.com at all?

cc @keegancsmith @sourcegraph/distribution for thoughts

ggilmore commented 4 years ago

To be clear I didn't mean for those questions in and of themselves to become new alerts - those are just the questions I'd want to figure out before deciding what to do next.

slimsag commented 4 years ago

Definitely, I understood that, I am in full agreement with you on those points as well. I was just trying to communicate the complexity involved here with figuring out how we could automate alerting of this -- sorry for being confusing in my issue description :)

keegancsmith commented 4 years ago

My intuition tells me to just make our disk cleaner trigger at 15% rather than adjusting the alert or number of replicas. The long tail of repos we have cloned are likely not interacted with everyday.

keegancsmith commented 4 years ago

I've bumped the desired percent free to 15% in https://github.com/sourcegraph/deploy-sourcegraph-dot-com/commit/e0dedf2562c853789bdf9511839e4d79e8a7a1b9

Is there any follow-up in tuning for other environments? IE do customers also have this alert?

slimsag commented 4 years ago

@keegancsmith I don't understand how this actually helps. Maybe I have missed something here?

From what I see, you've just made it so that the alert will never fire on sourcegraph.com, which is the same as if I had just removed the alert from sourcegraph.com all together (but a cool approach!), and we end up in the same end state: there is no alert for when we actually do need to scale gitserver on sourcegraph.com

creachadair commented 4 years ago

If I understand correctly, the case you want to detect here is when the principal component of a sufficiently-large aggregate search latency increase (i.e., not per-query) is in gitserver, is that right?

Could we get at this by deriving a time series for the fraction of search query latency that is due to gitserver? That should I think let us see the case where an abnormally-large number of searches requests force gitserver to fault the target repositories into the cache: When aggregate search query latency increases beyond threshold and aggregate gitserver fraction has increased sufficiently, we are resource-constrained by fetches.

The rate of aggregation is probably important to avoid abusive query patterns, but I think that could at least give us a baseline metric to observe.

keegancsmith commented 4 years ago

Yes sorry I totally missed the mark on this. This is hard to determine, since I assume any changes here would be gradual. I think what @creachadair said makes a lot of sense, but it may be quite hard to measure properly.

My intuition here is that there is a very long tail of repos which we clone and are only accessed for a few minutes then not accessed again. Additionally repos that are often accessed are also often updated (so don't get purged). On top of that, the repos that are expensive to clone are large => often updated => don't get purged. These are just my intuitions, I have not measured this (probably good for us to validate it).

I think some log analysis of our most (re)cloned repos would tell us if we are doing something wrong. I actually have lost track of the state of our logging in sourcegraph.com, but if we have say 3 months of gitserver logs we can grep that would be great. This tells us today if we are in a good place, but won't help us proactively discover when to increase disk capacity of gitserver. That is much harder, and will likely use an approach as suggested above.

keegancsmith commented 4 years ago

Rather than let this discussion just die off and have this issue open forever, what is next? It isn't clear to me what next steps we would take on this. I'm inclined to close this issue unless we have that.

creachadair commented 4 years ago

Rather than let this discussion just die off and have this issue open forever, what is next? It isn't clear to me what next steps we would take on this. I'm inclined to close this issue unless we have that.

At this point I think the only path forward is to do some experimentation: If we have enough trace instrumentation to tease apart how much of a query spends fetching, and if we have (or can create) a log of when we've had to bump up resources on gitserver nodes, we could graph the data and see if anything pops out. If not, then adding those things seems like the logical next step.

slimsag commented 4 years ago

how much of a query spends fetching

I don't think this is possible, search queries do not actually block on repositories cloning they simply omit search results for repositories that are cloning entirely and report that to the user via "N repositories missing".

We do track the clone rate of repositories, but it's not clear to me how we can use that to influence anything obvious. For example:

This shows we are regularly cloning 1 repository every 5-10s. But does that mean we should or shouldn't increase gitserver replicas?

Perhaps a more interesting angle here is going to be using gitserver signals for when it is at high load to determine whether or not it should be scaled, i.e. concurrent command executions being high, echo taking unexpectedly long, and search performance across the board being poor. This means we must solve these issues first.

Next steps:

The following issues on Sourcegraph.com must be addressed first so we can gauge whether or not Sourcegraph.com is actually healthy right now or not:

Once those are resolved, @slimsag will look at the before-and-after metrics and see if there is correlation between all three that indicate we should scale gitserver on .com as a special metric.

keegancsmith commented 4 years ago

Next steps:

The following issues on Sourcegraph.com must be addressed first so we can gauge whether or not Sourcegraph.com is actually healthy right now or not:

https://github.com/sourcegraph/sourcegraph/issues/9926

https://github.com/sourcegraph/sourcegraph/issues/9359

https://github.com/sourcegraph/sourcegraph/issues/9355

Once those are resolved, @slimsag will look at the before-and-after metrics and see if there is correlation between all three that indicate we should scale gitserver on .com as a special metric.

So the clearest next step then is owner for those issues. I'm going to be narrowly focused on indexing multiple branches to help unblock us shipping search contexts in 3.16. I haven't spent as much time as I like on those existing issues, but someone should. Do we have any volunteers, or do we punt on this until 3.17?

slimsag commented 4 years ago

https://github.com/sourcegraph/sourcegraph/issues/9355 and https://github.com/sourcegraph/sourcegraph/issues/9926 appear to at least be indicators of when the situation has gotten into a bad state (i.e. where a new gitserver is definitely needed), see https://sourcegraph.slack.com/archives/CMBA8F926/p1593116265154600

sourcegraph / sourcegraph-public-snapshot

How do we know when to scale gitserver on sourcegraph.com? #9357

Next steps:

Next steps: