Prevent services from running out of ephemeral storage space and being evicted

beyang commented 4 years ago

Set ephemeral storage resource requests and limits in deploy-sourcegraph. A customer noticed their sourcegraph-frontend pods were getting evicted because the node ran out of disk space. sourcegraph-frontend currently requests 0, and was using ~50 MB. On k8s.sgdev.org it uses ~600 MB.

uwedeportivo commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.15 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

uwedeportivo commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.16 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

beyang commented 4 years ago

@ggilmore please make this a top priority in 3.17 as this just caused a sitewide outage at an important customer (https://sourcegraph.slack.com/archives/CTGBWCKM0/p1590044563076400)

ggilmore commented 4 years ago

It's difficult to have a one-size-fits-all solution for this solely ephemeral storage limits / requests. Ephemeral storage includes both things like logs and all scratch space used by the container. My educated guess is that https://github.com/sourcegraph/sourcegraph/issues/8308 is the main culprit here (although this affects all services that need local caches for code ex. searcher, symbols, lang-go etc.). It's highly likely that the amount of scratch space each service uses varies depending on the instance size and traffic patterns. If we pick limits / requests that are too small, we run this risk of service degradation due to constant pod evictions (this can happen if frontend needs to clone a large repository, say more than 1 gig). If we pick limits / requests that are too large, we are in-effect enforcing minimum disk space requirements on each nodes. This may not be something that we have any control over.

In addition, there doesn't seem to be any way to have a real-time reading for the current ephemeral storage usage for a given container.

I think that fixing the unbounded cache issue (tracked in https://github.com/sourcegraph/sourcegraph/issues/8308) for each of these services (perhaps through a shared package) is the correct long-term solution.

aisbaa commented 4 years ago

It's difficult to have a one-size-fits-all solution for this solely ephemeral storage limits / requests.

We migrated from nodes with 50GB-100GB disks to nodes with 1000GB-500GB in hopes that this issue would be alleviated. But we still see it happen. From billing report disk is not that expensive and all these gigabytes would be wasted if we don't use those. To sum up it is okay to have high limits for ephemeral storage.

uwedeportivo commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.17 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

slimsag commented 4 years ago

https://github.com/sourcegraph/sourcegraph/issues/8308 for next steps here

uwedeportivo commented 4 years ago

got report from another customer with symbols pod getting evicted because of ephemeral-storage limits reached. they were set to 8Gi.