Closed ckaran closed 2 years ago
This is actually a really good idea. I think that we need to change to a "builder" pattern though for this, which I've been wanting to do for a while as the current "new" functions are getting a bit out of control 😂
Perhaps if you're restarting with these new values from the previous stats, we could also add some better tuning heuristics from the loaded stats? That way it improves the "self tuning" nature.
What do you think?
This is actually a really good idea. I think that we need to change to a "builder" pattern though for this, which I've been wanting to do for a while as the current "new" functions are getting a bit out of control joy
The builder pattern is a really good idea! It actually solves a lot of issues with ingesting the stats, like if you decide to deprecate a particular stat, or improve validation of the requested stats before a cache is built. It also makes it simple to create sensible defaults, which is something that a new user probably won't know how to do. I like it!
Perhaps if you're restarting with these new values from the previous stats, we could also add some better tuning heuristics from the loaded stats? That way it improves the "self tuning" nature.
That was pretty much exactly what I was thinking. As an end user though, I have no idea what stats are the most appropriate to use. I'm trusting that you know what you're doing and get it all right! :wink:
I saw PR #70, I like it. I do have the following ideas/suggestions though:
build
method non-consuming.ARCacheBuilder
so it completely subsumes everything that the cache stats do. That is, CacheStats
becomes an internal implementation detail of ARCacheBuilder
.view_stats
so that view_stats(&self) -> CowCellReadTxn<ARCacheBuilder>
Profit!
let new_cache = old_cache.view_stats().read().build();
This will also let you do some more interesting/weirder things in the future (if you want to). Assume that you've implemented serialization on ARCacheBuilder
. Now I (as a developer) can spend time running test sets through my code, optimizing the ARCacheBuilder
. Before I ship my product out, I serialize out all the current ARCacheBuilder
values. When the end user gets my product, it's already partially tuned. Tuning while running will result in some improvement, but won't require potentially waiting a long time to get it right (the expectation is that the shipped ARCacheBuilder
is close to optimal even on a different machine).
I hope all that makes sense. If not, I can write up some user stories outlining what I'm thinking about.
The builder pattern is a really good idea! It actually solves a lot of issues with ingesting the stats, like if you decide to deprecate a particular stat, or improve validation of the requested stats before a cache is built. It also makes it simple to create sensible defaults, which is something that a new user probably won't know how to do. I like it!
Excellent!
Perhaps if you're restarting with these new values from the previous stats, we could also add some better tuning heuristics from the loaded stats? That way it improves the "self tuning" nature.
That was pretty much exactly what I was thinking. As an end user though, I have no idea what stats are the most appropriate to use. I'm trusting that you know what you're doing and get it all right! 😉
I'm trusting that I know what I'm doing too 🤣 for now I'm only carrying through the p-weight (which is the balance between frequent and recent lists), but in the future we could actually use it to decide on if a per-thread read caches are needed based on the read hit rates. We may need to extend the stats for this though.
I saw PR #70, I like it. I do have the following ideas/suggestions though:
- Make the
build
method non-consuming.
Builds aren't really a common thing, is there a reason to make it non consuming. Not that I'm super opposed to this either. It's a once-off thing, it's not gonna change much.
- Alter
ARCacheBuilder
so it completely subsumes everything that the cache stats do. That is,CacheStats
becomes an internal implementation detail ofARCacheBuilder
.- Modify
view_stats
so thatview_stats(&self) -> CowCellReadTxn<ARCacheBuilder>
Interesting idea these two. I think I'd rather that the stats say seperate to a builder, but I can kind of see where you're going here. Not everyone will use that pattern and it's not a long step to just move the stats into the builder as is. But we can also refine it in the future too.
Profit!
let new_cache = old_cache.view_stats().read().build();
This will also let you do some more interesting/weirder things in the future (if you want to). Assume that you've implemented serialization on
ARCacheBuilder
. Now I (as a developer) can spend time running test sets through my code, optimizing theARCacheBuilder
. Before I ship my product out, I serialize out all the currentARCacheBuilder
values. When the end user gets my product, it's already partially tuned. Tuning while running will result in some improvement, but won't require potentially waiting a long time to get it right (the expectation is that the shippedARCacheBuilder
is close to optimal even on a different machine).
Now I DID think about this one! the issue with serialising the stats is that if we change the stats format, we then have a breaking API change! Where as currently, they are semi-opaque, we can keep them as an internal detail.
I think if you wanted to be able to pre-tune things, I think there could be better ways to express that than stats, I think the idea to pre-tune is interesting for certain. But part of the idea of this structure is that it "self corrects", and really there are very few things that actually do need tuning. Generally it's the cache size, if thread local readers exist, and how big they are. These are all already configurable. Even moving the p-weight across isn't going to be a "big tuning" option since that will adjust overtime on any consumers deployment based on their actual workload vs your development one.
I hope all that makes sense. If not, I can write up some user stories outlining what I'm thinking about.
It does make sense! I think the main benefit today re the stats being passed in is that you don't lose your metrics during runtime. I think the tuning ideas can grow from here now we have a builder pattern and can experiment a bit more. :)
If you wanted I can certainly go through with you what the actual parameters are that we can tune so we can work out what can be done here.
Actually, now I think about it more, carrying the p-weight over is a bad idea so I might change that for now.
Now I DID think about this one! the issue with serialising the stats is that if we change the stats format, we then have a breaking API change! Where as currently, they are semi-opaque, we can keep them as an internal detail.
No argument there! The serialization format used would have to include version numbers internally (maybe protocol buffers?), so that newer versions are able to read older saved files. Beyond that, the fallback option is that if something can't be parsed, then we rebuild with whatever the current 'sensible defaults' are. That way the tuning parameters themselves are sort of like cached values; useful to have, but not critical as the 'real' values can be obtained through other means.
But part of the idea of this structure is that it "self corrects", and really there are very few things that actually do need tuning.
Yup, that's what I'm counting on! :grin: Even if someone passes in utterly horrible tuning statistics, the cache should be able to rapidly correct for this.
If you wanted I can certainly go through with you what the actual parameters are that we can tune so we can work out what can be done here.
Yes, I would be interested in that. You've got me thinking about how to maximize the rate of improvement from horrible to good performance, and what the best way of making this happen would be so that if something outside of your control changes you can adapt to it (e.g., your process is running fine, but the user starts up another process that consumes a large amount of RAM/CPU, which incidentally causes some of your address space to get swapped out; how fast can you recover from what is now 'bad' to something 'good'?)
Yup, that's what I'm counting on! 😁 Even if someone passes in utterly horrible tuning statistics, the cache should be able to rapidly correct for this.
It already does. And by the nature of the concurrency, it's actually better than otherwise: https://github.com/kanidm/concread/blob/master/CACHE.md
If you wanted I can certainly go through with you what the actual parameters are that we can tune so we can work out what can be done here.
Yes, I would be interested in that. You've got me thinking about how to maximize the rate of improvement from horrible to good performance, and what the best way of making this happen would be so that if something outside of your control changes you can adapt to it (e.g., your process is running fine, but the user starts up another process that consumes a large amount of RAM/CPU, which incidentally causes some of your address space to get swapped out; how fast can you recover from what is now 'bad' to something 'good'?)
Well, the issue there is to us as a process we don't know we've been swapped. Page faults are transparent, so we'd just have a stall. The only way we'd be able to make an informed choice is if we were sent a signal to release memory (I don't remember if that's a thing or not) or we supported dynamic cache size reduction. So we can't do much here honestly. All we can do is have the hashmap and other parts attempt to be cache aligned and sized to assist the prefetcher (already done) and we use simd to speed up key comparison (needs update due to std::simd in nightly).
Generally though, arcache here is pacman hungry due to how we have to maintain temporal ordering of elements. It's not designed for smaller deployments, as much as a large application, with access to a lot of memory. We also will have issues with high rates of key churn, since we can never release keys, only values due to how we have to preserve time ordering.
The main way we could actually improve our memory footprint is with hazard pointers and doing clean ups of haunted cache items (keys that did exist that we need to pin for temporal ordering reasons). Currently though we don't know what point-in-time reader transactions are at, but with HP we would know exactly that each reader has progressed past point X which would allow us to then remove any haunted name key older than that point which would potentially allow the hashmap to shrink and we free older keys.
So I'd say that would be the major thing we need, is hazard pointers if we want to improve this further.
A minor optimisation also exists in selecting the size for bounding the reader channels, as if this grows too large it can also cause stalls on commit/quiesce calls.
There are a number of benchmarks in the project, which I intend to expand overtime which could help show some of these behaviours. :)
It already does. And by the nature of the concurrency, it's actually better than otherwise: https://github.com/kanidm/concread/blob/master/CACHE.md
Nice article! It reminds me of what im and contrie do, but yours seems to be more advanced as it uses thread-local storage, whereas im and contrie use atomic pointer operations.
Well, the issue there is to us as a process we don't know we've been swapped. Page faults are transparent, so we'd just have a stall. The only way we'd be able to make an informed choice is if we were sent a signal to release memory (I don't remember if that's a thing or not) or we supported dynamic cache size reduction. So we can't do much here honestly. All we can do is have the hashmap and other parts attempt to be cache aligned and sized to assist the prefetcher (already done) and we use simd to speed up key comparison (needs update due to std::simd in nightly).
Yeah, that's why I was thinking about periodically building a new cache. The idea is that you might keep data on how the cache is performing over time, so that whenever it starts to slow down you can build a new cache. This may also make it possible to deal with the haunted caches in a simple manner by having a type that transparently shifts over from the old cache to the new cache, dropping the old cache when it's no longer in use. Hazard pointers are more elegant and likely better in the longer term, but if you create a new type that wraps the caches and hides the implementation details, you could start out with this approach, and then switch to hazard pointers later on.
I just realized that the haunted set and temporal ordering guarantees remind me a LOT of operation-based CRDTs. Maybe some of the research that's been done there could be of help?
It already does. And by the nature of the concurrency, it's actually better than otherwise: https://github.com/kanidm/concread/blob/master/CACHE.md
Nice article! It reminds me of what im and contrie do, but yours seems to be more advanced as it uses thread-local storage, whereas im and contrie use atomic pointer operations.
Exactly - im and contrie are both atomic per operation where everything in concread is transactional over many operations that succeed or fail together. That's the primary difference between the two library approaches, and they have very different use cases!
Well, the issue there is to us as a process we don't know we've been swapped. Page faults are transparent, so we'd just have a stall. The only way we'd be able to make an informed choice is if we were sent a signal to release memory (I don't remember if that's a thing or not) or we supported dynamic cache size reduction. So we can't do much here honestly. All we can do is have the hashmap and other parts attempt to be cache aligned and sized to assist the prefetcher (already done) and we use simd to speed up key comparison (needs update due to std::simd in nightly).
Yeah, that's why I was thinking about periodically building a new cache. The idea is that you might keep data on how the cache is performing over time, so that whenever it starts to slow down you can build a new cache. This may also make it possible to deal with the haunted caches in a simple manner by having a type that transparently shifts over from the old cache to the new cache, dropping the old cache when it's no longer in use. Hazard pointers are more elegant and likely better in the longer term, but if you create a new type that wraps the caches and hides the implementation details, you could start out with this approach, and then switch to hazard pointers later on.
The main cost there is "when" do we rebuild? We'd need the user to run a cleanup/housekeeping thread then to periodically run this task, or they would need to be responsible to call this. We'd also pay a large cost in mem-copy to move content between the old and new cache, and for a short time we'd likely lose our cpu cache.
I think before I'd go all in on something like this I'd be good to explore workloads that do show this kind of progressive slow down over time. And I also think if I was going to commit the effort, I think HP sounds better as a longer term solution anyway :) I've been thinking how to make these work a bit nicer with the haunted sets to make them cheaper this morning, but currently I have a stack of work so I may not get time to write this for a little.
EDIT
I just realized that the haunted set and temporal ordering guarantees remind me a LOT of operation-based CRDTs. Maybe some of the research that's been done there could be of help?
Interesting! I'd never heard of these, I'll have to look into it :)
Exactly - im and contrie are both atomic per operation where everything in concread is transactional over many operations that succeed or fail together. That's the primary difference between the two library approaches, and they have very different use cases!
I like concread's approach, it's more flexible than the others. I think it lends itself well to something like what I was talking about in #71, as well as other possible approaches.
The main cost there is "when" do we rebuild? We'd need the user to run a cleanup/housekeeping thread then to periodically run this task, or they would need to be responsible to call this. We'd also pay a large cost in mem-copy to move content between the old and new cache, and for a short time we'd likely lose our cpu cache.
I think before I'd go all in on something like this I'd be good to explore workloads that do show this kind of progressive slow down over time. And I also think if I was going to commit the effort, I think HP sounds better as a longer term solution anyway :) I've been thinking how to make these work a bit nicer with the haunted sets to make them cheaper this morning, but currently I have a stack of work so I may not get time to write this for a little.
100% agree with you on all points! This approach has a lot of drawbacks, and only one advantage; it requires no changes to ARCache itself. That means that you can concentrate on putting together a public facing API with the knowledge that the guts will need to be reworked at some point in the future. That said, if you would rather work on the guts first and the API later, that's a good approach too.
Interesting! I'd never heard of these, I'll have to look into it :)
CRDTs in general are a fascinating topic, and (at least from a theoretical point of view) pair well with immutable data structures. Before I started looking at concread, I was actually using the ideas from them to develop a concurrently writable database that was written to directly by each process (you don't connect to a database engine, every process interacts with the database's contents directly, without locking, and without messing up other processes. I never got beyond the sketching phase). That said, I think I'm starting to diverge from this issue's topic, and into #71 a bit, so I'll put comments regarding design over there. :wink:
I like concread's approach, it's more flexible than the others. I think it lends itself well to something like what I was talking about in #71, as well as other possible approaches.
Yeah, it's just different is all. There are valid use cases for both im/concread, but I'm glad you appreciate concread existing!
100% agree with you on all points! This approach has a lot of drawbacks, and only one advantage; it requires no changes to ARCache itself. That means that you can concentrate on putting together a public facing API with the knowledge that the guts will need to be reworked at some point in the future. That said, if you would rather work on the guts first and the API later, that's a good approach too.
I'm pretty burnt by "quick fix, we'll do it right later" people, because they never fix it later. So I tend to be a "lets get it right and do our best" these days, so I'd rather change the internals. The HP approach also needs no extra changes to the external API so there is that ....
Interesting! I'd never heard of these, I'll have to look into it :)
CRDTs in general are a fascinating topic, and (at least from a theoretical point of view) pair well with immutable data structures. Before I started looking at concread, I was actually using the ideas from them to develop a concurrently writable database that was written to directly by each process (you don't connect to a database engine, every process interacts with the database's contents directly, without locking, and without messing up other processes. I never got beyond the sketching phase). That said, I think I'm starting to diverge from this issue's topic, and into #71 a bit, so I'll put comments regarding design over there. 😉
Still cool. Always happy to email and chat OT if you want. william at blackhats dot net dot au. :)
Yeah, it's just different is all. There are valid use cases for both im/concread, but I'm glad you appreciate concread existing!
Agreed, but the power of transactions is what makes concread magical. If you want to get the same results with im or contrie, you have to do a lot more work. BUT, as you say, they each have their own uses, you just need to know what those uses are and use them correctly.
I'm pretty burnt by "quick fix, we'll do it right later" people, because they never fix it later. So I tend to be a "lets get it right and do our best" these days, so I'd rather change the internals. The HP approach also needs no extra changes to the external API so there is that ....
I agree and I understand. The reason for working on the APIs first is that once public APIs are published they tend to congeal. As a result, even if you're still pre-1.0, people will start to expect that certain behavior is adhered to. So now you get a choice; figure out the API first, or figure out the guts first. There is no right answer here; a good API and horrible guts leads to a horrible user experience and a bad reputation as people remember what your performance used to be like. A bad API and great guts leads to very little adoption because your API is hard to use. Moreover, if you realize that there are better APIs out there (like when you switched from using 'new...()' to using the builder pattern), you have to spend the time deprecating old APIs and educating users as to what the new APIs are.
Regardless of the choice of what you work on first, if you ultimately choose a 'quick fix' route and never fix it, you should be sentenced to working the support desk for your product for the rest of your life...
Still cool. Always happy to email and chat OT if you want. william at blackhats dot net dot au. :)
Likewise! cfkaran2 at gmail dot com... One word of warning though, I tend to be pretty slow answering my emails, I have a young son and he keeps me VERY busy in my non-work time...
This is a feature request for
ARCache
. Can you please add a new associated function like the following:The idea is that when we have long running processes we can periodically restart our cache with better values by doing something like
let new_cache = ARCache::new_from_stats(old_cache.view_stats())
and slowly rolling over into the new cache. I haven't dug into the internals ofARCache
enough to know if this would be a good idea or not though, so if it's a bad idea, please ignore this request.