Open markan opened 4 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
Our bldr service uses an external caching service to help lower latency to our customers. It is generally believed it is helpful, but we don't we don't have good insight into what it is doing for us.
A large fraction of our API responses are marked private, to prevent caching. On average about 11% of our request volume (by call count) is actually cached. However that cached content still amounts to a sizable volume of data.
The objectives of this are twofold:
Understand exactly what we do cache, and understand the impact of the request volume avoided. Not caching would involve fewer moving parts and make troubleshooting easier. Could we handle the volume without caching? Improvements to reduce the need to cache might also improve scaling for on-prem builder instances.
Understand what we do not cache, and the impact of that going through the caching service. Multiple layers of miss resolution seem to occur before actually going to builder. Is it worthwhile to disable caching entirely in the service? What benefits does streamlining this path bring?
Tasks likely needed to be performed include:
Devise a measure to understand our external responsiveness. We have good visibility in our internal API. Getting various packages via https://bldr.habitat.sh/v1/depot/channels/core/stable/pkgs/$pkg/latest?target=x86_64-linux observes our non-cached path, but we need to make sure we cover both cached and non-cached resources. This should be done from someplace not in us-west2
Experiment with filter rules to turn off caching for various portions of our API. This gives us an incremental knob to explore the impact of caching, both on eternal responsiveness, and our internal load. Experiments should include:
Use filters to turn off caching for various resources that are marked private and are thus not cached. This might speed things up by avoiding miss logic that queries higher level servers.
Use filters to turn off caching for various resources that are marked cached, and should benefit from caching. The assumption is that those APIs will become slower and increase load on the server. We should measure how much impact that actually has. The server has been substantially rewritten since we started caching, and the hot spots likely will have changed.
Measure how much of the traffic is package downloads vs API calls, both in terms of request count and data volumes.
Aha! Link: https://chef.aha.io/features/APPDL-37