sasa1977 / con_cache

ets based key/value cache with row level isolated writes and ttl support
MIT License
910 stars 71 forks source link

Emit telemetry events for hit and miss #76

Closed aerosol closed 7 months ago

aerosol commented 7 months ago

Tentative first pass on #75

Since get might be effectively called twice by get_or_store, we need a way to suppress emitting telemetry events on plain get.

I'll add documentation upon implementation approval.

aerosol commented 7 months ago

Thanks for the review, all sounds sensible. Pushed some changes already but some are pending still - I'll ping you when done.

aerosol commented 7 months ago

I think I'm having trouble understanding the separation of concerns you want me to achieve.

Remove all get functions from Operations. Support telemetry in Operations.fetch just like you did with get. Implement get & co in ConCache on top of fetch.

This I think I've done, but "implement get & co in ConCache" is a bit unclear. Whilst I understand that you can now do:

  # defmodule ConCache
  @spec get(t, key) :: value
  def get(cache_id, key) do
    case fetch(cache_id, key) do
      :error -> nil
      {:ok, value} -> value
    end
  end

now by "& co" I assume you mean get_or_store? So does that also mean we should get rid of Operations.get_or_store so that Operations exposes only isolated_get_or_store? That way the telemetry events emission would have to be done in ConCache or am I completely missing the point? Sorry, it's been a long day 😵‍💫

sasa1977 commented 7 months ago

I assume you mean get_or_store?

Yeah, I was thinking about all versions of get (including the dirty ones). If we have fetch (& co) in place, then all versions of get could use those functions, and just convert the result to value | nil, right?

aerosol commented 7 months ago

Sorry, totally confused. What do we keep in Operations then?

If get_or_store is implemented in ConCache, using local fetch (that delegates to Operations), and dirty_get_or_store is also implemented in ConCache (the actual logic this time), what's left in Operations? Just value/1 and isolated/3? What about checking ets type validity?

That way Operations exposes a minimal low-level API for interacting with the cache which is sufficient to implement the full functionality in the ConCache module.

Basically this is the boundary I'm having hard time to recognize I guess.

aerosol commented 7 months ago

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

sasa1977 commented 7 months ago

What do we keep in Operations then?

Now that you ask, and looking at the code, I see little value in having this module :-D

I wasn't asking you to do such merge, but I was proposing a part of that change, which in hindsight was needless and caused this confusion and PR bloat.

So let's instead avoid these refactorings and just focus on getting telemetry in. I can unify these two modules once this is merged.

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

OK?

Another concern with all this is, removing fetch_or_store from Operations will be effectively a breaking change. Is that fine?

Operations is not a part of the ConCache public API. It's explicitly marked with @moduledoc false, which makes it internal.

aerosol commented 7 months ago

So I propose reverting the last two commits, and then just make sure that telemetry is emitted for the fetch operations.

@sasa1977 done, sorry for the delay - life got in my way. Please hold off with merging until I confirm that this branch works as expected on my end.

BTW, not entirely related, but just wanted to mention it: there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes for a cache with the following settings:

ttl_check_interval: :timer.seconds(1),
global_ttl: :timer.minutes(30)

image

The app nodes are identical and the load balancing was verified granting resource access evenly, so it must be something in ConCache internals.

The growth continued indefinitely - one node had allocated around ~68GiB of memory, while the other two stayed in the 3GiB ballpark. Unfortunately, I did not manage to inspect any specific process info before the node was simply restarted. I'll provide more information once/if this happens again.

aerosol commented 7 months ago

Please hold off with merging until I confirm that this branch works as expected on my end.

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Another slight inconvenience is, currently it's the Operations module that defines the telemetry events shape - via Operations.telemetry_hit() and Operations.telemetry_miss() functions. As you've pointed out, @moduledocs false makes the API effectively private. I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

I recall Oban changing the telemetry event shape under a minor release at some point. That's what I'd like to avoid from the client perspective here - the telemetry is in fact an integration contract.

WDYT?

sasa1977 commented 7 months ago

I guess what's not really ideal with this approach is, the telemetry arrives with %{cache: %ConCache{}} metadata and the cache_id is nowhere to be found, since it's effectively discarded at ConCache module level. On the metrics collection end, we're mainly interested in being able to associate the stats with cache_id.

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

I think it would be good to have those functions extracted to a separate module, such as ConCache.Telemetry - so that the clients can subscribe to events without having to manually duplicate the event lists without having to worry about breaking changes later on.

You can do that as a part of this PR, to keep it focused. But my goal immediately after is to merge everything inside ConCache. There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

there must be some bug in global TTL eviction. We've observed an uncontrolled size growth on one of the application nodes

This is worrying. A few possible explanations that come to mind:

  1. Normal behavior - ttl doesn't always guard against the cache growth. It only ensures that the items are purged after the last write (or read if you turn on touch on read). If you frequently extend item's ttl, the cache may grow indefinitely.
  2. Owner overload - expiry is done by the owner process (https://github.com/sasa1977/con_cache/blob/1.0.0/lib/con_cache/owner.ex). If the cache size is large, the owner might not be able to keep up. How many items do you have in the cache?
  3. Some actual bug in the expiry logic. This is always a possibility, but it wouldn't be my first bet, given that the library has been used in production, and it hasn't changed much in the past 10 years.

1 or 2 would be my first suspects. Just glancing at the code, I see one possible issue: the owner process performs isolated (i.e. locked) deletes for every item that needs to be expired. In a large cache this might be too slow, causing the owner to lag increasingly.

aerosol commented 7 months ago

Yeah, good point! We should include cache id too, which I think can be done if we propagate the id to the lower-level operations.

Let me know if you approve the attempt at https://github.com/sasa1977/con_cache/pull/76/commits/3ccb5d14565cb4ffb50e30ae78facb3afa1d41d1

There's just too little stuff here to justify the split, and the fact that we have this split now forces us to add more "micromodules".

Fair enough, thanks! :+1:

How many items do you have in the cache?

image

On a normal day it peaks close to 600K predictably - keys are {int(), int()} tuples, values are structs like:

``` %Plausible.ClickhouseSessionV2{ __meta__: #Ecto.Schema.Metadata<:built, "sessions_v2">, hostname: "redacted.example.com", site_id: 0, user_id: 11111111111111, session_id: 11111111111111, start: ~N[2024-01-04 06:46:20], duration: 605, is_bounce: false, entry_page: "/", exit_page: "/", pageviews: 2, events: 2, sign: 1, "entry_meta.key": [], "entry_meta.value": [], utm_medium: nil, utm_source: nil, utm_campaign: nil, utm_content: nil, utm_term: nil, referrer: nil, referrer_source: nil, country_code: "XX", subdivision1_code: "XX-10", subdivision2_code: nil, city_geoname_id: 1111111, screen_size: "Desktop", operating_system: "Mac", operating_system_version: "10.15", browser: "Firefox", browser_version: "124.0", timestamp: ~N[2024-01-04 06:56:25], transferred_from: nil } ```

On that unlucky node over several days, it peaked at 200M, despite the traffic being as usual. I might be misinterpreting it, but I don't see any evictions - the growth is constant. And we do have quieter periods over the course of the day - so I assume the Owner, even if overloaded, should be able to perform some deletions at least?

image

Anyway, I'll share more detailed info if/when this happens again.

sasa1977 commented 7 months ago

I might be misinterpreting it, but I don't see any evictions - the growth is constant.

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see https://github.com/sasa1977/con_cache/issues/73.

sasa1977 commented 7 months ago

Let me know if you approve the attempt at 3ccb5d1

Looks good to me. Do you plan to make other changes or you're done?

aerosol commented 7 months ago

Just to rule out the obvious cause, are you inserting plain values, or are you using ConCache.Item struct? In the former case, there's a gotcha with ttl, see https://github.com/sasa1977/con_cache/issues/73.

I'm using plain values with global_ttl and the other app nodes were expiring just as expected (all are identical).

aerosol commented 7 months ago

Looks good to me. Do you plan to make other changes or you're done?

I'm done. Made a test integration branch on my end and all is well. Thanks for being so responsive on this.

aerosol commented 6 months ago

Hey, thanks for merging. Any chance for a hexpm release?

sasa1977 commented 6 months ago

published 1.1.0