Closed aerosol closed 7 months ago
Thanks for the review, all sounds sensible. Pushed some changes already but some are pending still - I'll ping you when done.
I think I'm having trouble understanding the separation of concerns you want me to achieve.
Remove all get functions from Operations. Support telemetry in Operations.fetch just like you did with get. Implement get & co in ConCache on top of fetch.
This I think I've done, but "implement get & co in ConCache" is a bit unclear. Whilst I understand that you can now do:
# defmodule ConCache
@spec get(t, key) :: value
def get(cache_id, key) do
case fetch(cache_id, key) do
:error -> nil
{:ok, value} -> value
end
end
now by "& co" I assume you mean get_or_store
? So does that also mean we should get rid of Operations.get_or_store
so that Operations
exposes only isolated_get_or_store
? That way the telemetry events emission would have to be done in ConCache
or am I completely missing the point? Sorry, it's been a long day 😵💫
I assume you mean
get_or_store
?
Yeah, I was thinking about all versions of get (including the dirty ones). If we have fetch
(& co) in place, then all versions of get could use those functions, and just convert the result to value | nil
, right?
Sorry, totally confused. What do we keep in Operations
then?
If get_or_store
is implemented in ConCache
, using local fetch
(that delegates to Operations
), and dirty_get_or_store
is also implemented in ConCache
(the actual logic this time), what's left in Operations
? Just value/1
and isolated/3
? What about checking ets type validity?
That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.
Basically this is the boundary I'm having hard time to recognize I guess.
Another concern with all this is, removing fetch_or_store
from Operations
will be effectively a breaking change. Is that fine?
What do we keep in Operations then?
Now that you ask, and looking at the code, I see little value in having this module :-D
I wasn't asking you to do such merge, but I was proposing a part of that change, which in hindsight was needless and caused this confusion and PR bloat.
So let's instead avoid these refactorings and just focus on getting telemetry in. I can unify these two modules once this is merged.
So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.
OK?
Another concern with all this is, removing
fetch_or_store
fromOperations
will be effectively a breaking change. Is that fine?
Operations
is not a part of the ConCache public API. It's explicitly marked with @moduledoc false
, which makes it internal.
So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.
@sasa1977 done, sorry for the delay - life got in my way. Please hold off with merging until I confirm that this branch works as expected on my end.
BTW, not entirely related, but just wanted to mention it: there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes for a cache with the following settings:
ttl_check_interval: :timer.seconds(1),
global_ttl: :timer.minutes(30)
The app nodes are identical and the load balancing was verified granting resource access evenly, so it must be something in ConCache internals.
The growth continued indefinitely - one node had allocated around ~68GiB of memory, while the other two stayed in the 3GiB ballpark. Unfortunately, I did not manage to inspect any specific process info before the node was simply restarted. I'll provide more information once/if this happens again.
Please hold off with merging until I confirm that this branch works as expected on my end.
I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}}
metadata and the cache_id
is nowhere to be found, since it's effectively discarded at ConCache
module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id
.
Another slight inconvenience is, currently it's the Operations
module that defines the telemetry events shape - via Operations.telemetry_hit()
and Operations.telemetry_miss()
functions. As you've pointed out, @moduledocs false
makes the API effectively private. I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry
- so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.
I recall Oban changing the telemetry event shape under a minor release at some point. That's what I'd like to avoid from the client perspective here - the telemetry is in fact an integration contract.
WDYT?
I guess what's not really ideal with this approach is, the telemetry arrives with
%{cache: %ConCache{}}
metadata and thecache_id
is nowhere to be found, since it's effectively discarded atConCache
module level. On the metrics collection end, we're mainly interested in being able to associate the stats withcache_id
.
Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.
I think it would be good to have those functions extracted to a separate module, such as
ConCache.Telemetry
- so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.
You can do that as a part of this PR, to keep it focused. But my goal immediately after is to merge everything inside ConCache
. There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".
there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes
This is worrying. A few possible explanations that come to mind:
1 or 2 would be my first suspects. Just glancing at the code, I see one possible issue: the owner process performs isolated (i.e. locked) deletes for every item that needs to be expired. In a large cache this might be too slow, causing the owner to lag increasingly.
Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.
Let me know if you approve the attempt at https://github.com/sasa1977/con_cache/pull/76/commits/3ccb5d14565cb4ffb50e30ae78facb3afa1d41d1
There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".
Fair enough, thanks! :+1:
How many items do you have in the cache?
On a normal day it peaks close to 600K predictably - keys are {int(), int()}
tuples, values are structs like:
On that unlucky node over several days, it peaked at 200M, despite the traffic being as usual. I might be misinterpreting it, but I don't see any evictions - the growth is constant. And we do have quieter periods over the course of the day - so I assume the Owner, even if overloaded, should be able to perform some deletions at least?
Anyway, I'll share more detailed info if/when this happens again.
I might be misinterpreting it, but I don't see any evictions - the growth is constant.
Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item
struct? In the former case, there's a gotcha with ttl, see https://github.com/sasa1977/con_cache/issues/73.
Let me know if you approve the attempt at 3ccb5d1
Looks good to me. Do you plan to make other changes or you're done?
Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see https://github.com/sasa1977/con_cache/issues/73.
I'm using plain values with global_ttl
and the other app nodes were expiring just as expected (all are identical).
Looks good to me. Do you plan to make other changes or you're done?
I'm done. Made a test integration branch on my end and all is well. Thanks for being so responsive on this.
Hey, thanks for merging. Any chance for a hexpm release?
published 1.1.0
Tentative first pass on #75
Since
get
might be effectively called twice byget_or_store
, we need a way to suppress emitting telemetry events on plainget
.I'll add documentation upon implementation approval.