Rewrite DiskCache to allow shared DiskCache Directory. Add heartbeats.

bytesized commented 7 years ago

Some requests were made for Snappy's features. This PR should address them.

@willkg Think you could review all this for me?

willkg commented 7 years ago

I think I can get to this before the end of tomorrow. Is that ok, or do you need it sooner?

bytesized commented 7 years ago

Tomorrow should be fine.

willkg commented 7 years ago

So... I don't understand the premise for the diskcache changes. Seems like you want to get this system working in an ops-managed environment (AWS or something) and that will involve multiple processes (and multiple nodes?). If that's the case, why use a disk cache at all? Why not just do everything in memcache?

bytesized commented 7 years ago

As for why ops wants it to work this way, I will leave that to ops to explain, since they understand it better than I (they have been pointed towards this post). As for why we still want a DiskCache in this case, there are several reasons:

Retrieving files from the DiskCache directory is still much faster than requesting them from the Symbol Server, so the DiskCache is still providing caching.
The DiskCache strips unnecessary info out of the symbol files and sorts the saved data making lookups quicker and the data files smaller.
Using only memcached means one of these options: Adding all symbols in the file to memcached, including the thousands of ones that the request didn't use, or re-downloading the same file every time a new symbol from the same file is used.

bytesized commented 7 years ago

Just as a warning. I have been asked to make one quick addition to this PR. Heartbeat requests will respond with a JSON object noting what is accessible.

bytesized commented 7 years ago

Well, I added some heartbeat information, but I left it undocumented because I realized that it does not return a response on failure. Unfortunately this means that the response will always be exactly the same response, or an HTTP 500. I'm afraid this is the best I can do without larger changes to allow responses to be returned to server errors.

willkg commented 7 years ago

First off, as a reviewer, I find it frustrating to review changes for two different features intertwined with one another when I don't see a good reason for it to be so. It would have been easier on me if you had done a PR for the heartbeat change and a separate PR for the diskcache changes.

I spent an hour reading through the code mulling over what's going on and how it affects the grand scheme of things, but it's difficult. The code lacks unit tests, there's no documentation about how it's supposed to work, there's no explanation in the commit messages or the PR description and the implementation doesn't follow an obvious structure for caching that I'm aware of. Couple that with the choice of using sqlite3 to manage multiple processes managing the disk cache--that's tricky and it makes it really hard to determine whether this implementation is correct.

Backing up a step, what are the requirements? How is this being run? What are the expectations for this code? What's the high level API for how this works? Do the existing integration tests cover all the code paths? Why don't classes, methods and functions have docstrings explaining what they do and how to use them?

Micheletto commented 7 years ago

If that's the case, why use a disk cache at all? Why not just do everything in memcache?

As for why we still want a DiskCache in this case, there are several reasons: ... Using only memcached means one of these options: Adding all symbols in the file to memcached, including the thousands of ones that the request didn't use, or re-downloading the same file every time a new symbol from the same file is used.

I'll preface this response by stating that I have no knowledge of the access patterns for this data. We'll collect that information after the service is running and may come to some different architectural conclusions later. I'm not committed to using DiskCache if it makes sense not to, however it seems like a sane intermediate step.

What are your requirements for service request time?

I expect that running everything directly out of memcached, by populating every symbol every time, will lengthen queue times significantly and require an otherwise larger memcached instance size. But perhaps it would be within your toleration and maybe cheaper or as cost effective after removing the NFS volume from the stack. I don't have the necessary metrics to answer this question right now.

Likewise, I would expect fetching a file from S3 every time a new symbol is needed would lengthen average service times and increase S3 API call fees. Again, perhaps that would still fall within your expected normal average service time and be in the same general cost area.

ddurst commented 7 years ago

I'm a little confused, so bear with me:

We don't yet know the patterns for this either. Scaling is unknown. We've never had instrumentation to measure performance (hence the desire to move this to Ops purview).

The previous version of this quite literally was one machine with some limited caching. It would get files from S3, cache them, and only re-get them from S3 as necessary (if they got expired from the cache, etc).

So ignoring any notion of whether we'd scale by increasing servers that share one cache or by increasing cache (since we don't know how well the server handles response load), is there an advantage to changing DiskCache's setup before we see even how the vanilla version of this behaves?

I think we have two big learning curves: 1) webdev isn't our bag, so all of the dockerflow stuff is new/voodoo to us, 2) performance of this rewrite is unknown, and it is intended to be the first measurable version of symbolapi.m.o.

So perhaps I'm proposing that we get the Docker part verified/approved, and pursue other architectural changes once we have at least baseline numbers. I realize that's not ideal, but it's probably sane.

Thoughts? @Micheletto

Micheletto commented 7 years ago

So ignoring any notion of whether we'd scale by increasing servers that share one cache or by increasing cache (since we don't know how well the server handles response load), is there an advantage to changing DiskCache's setup before we see even how the vanilla version of this behaves?

Let's keep DiskCache until and unless we have performance metrics or cost reasons to do otherwise. Are you asking whether we need to share the DiskCache across servers? Yes we do. That's a best practice that enables everything we do to run a service smoothly.

I think we have two big learning curves: 1) webdev isn't our bag, so all of the dockerflow stuff is new/voodoo to us, 2) performance of this rewrite is unknown, and it is intended to be the first measurable version of symbolapi.m.o.

Once the Dockerflow pieces are in place, everything will just fit into our infrastructure. If it would be helpful there are plenty of pythonic examples I can point to. Also, are there any load test scripts for this service?

Looking forward to getting this up and running and sharing some performance dashboards.

bytesized commented 7 years ago

@Micheletto There are is a fairly simplistic load test script that lives here. Documentation on running tests is here.

bytesized commented 7 years ago

@willkg I'm not sure that I understand all of what you are asking for.

Which requirements are you asking about? The server inputs/outputs have not changed. Ops wanted to add the requirements that DiskCache processes be able to share their cache directory and heartbeat requests should be responded to as specified here.

My understanding for how it will be run is the same as the instructions for Quick Start without Docker. Ops will handle running it in a container themselves.

The high-level API has not changed. It is the same as is documented here.

Existing integration tests do test the DiskCache. See here.

I felt that the classes, methods and functions were sufficiently self-documenting. In a few cases I did add docstrings when I felt something was unclear. Is there additional documentation that you would like me to add?

mozilla / Snappy-Symbolication-Server

Rewrite DiskCache to allow shared DiskCache Directory. Add heartbeats. #67