pool telemetry: available workers

aymanosman commented 2 years ago

Correct me if I'm wrong, but it seems there is no way of querying finch for a pool's current state, for example, number of available workers.

:poolboy exposes this information via :poolboy.status/1. I know the pooling implementation differs significantly between the two libraries, so do you think it is possible to get the same sort of information from finch?

oliveigah commented 2 years ago

There are 2 pool implementations in finch one for http1 that is nimble pool and other for http2 made by finch.

Regarding your question, IMO it could be implemented in nimble_pool and exposed by finch for http1 pools. AFAIK there is no way to access this internal data from nimble_pool, but I need to check, if it exists there is only a matter of expose it on finch API.

For http2 usually the pool is simpler and I think this feature may not be needed.

Can you tell us more about what is your use case that requires to read the pool status @aymanosman? What are you trying to accomplish? I feel like you shouldn't need to have access to this internals and maybe we can solve this differently! 😄

aymanosman commented 2 years ago

What I am saying definitely only applies to the HTTP1 pool (as that is the only one we are using).

Because Finch implements the NimblePool behaviour, it has access to all of the "nimble pool state".

Playing around with the nimble pool that powers finch, I can try and guess a way of calculating the number of "available workers".

{:ok, _} = Finch.start_link(name: HTTP, pools: %{"https://httpbin.org" => [size: 5]})
{pool, _} = Finch.PoolManager.lookup_pool(HTTP, {:https, "httpbin.org", 443})

req = Finch.build(:get, "https://httpbin.org")

state0 = :sys.get_state(pool)
for _ <- 0..7 do spawn(fn -> Finch.request(req, HTTP) end) end
Process.sleep(1) # give the spawned processes a chance
state1 = :sys.get_state(pool)
# wait a while
state2 = :sys.get_state(pool)

stats = fn state ->
  %{available_workers: state.lazy + :queue.len(state.resources), queued_requests: :queue.len(state.queue)}
end

stats.(state0) # %{available_workers: 5, queued_requests: 0}
stats.(state1) # %{available_workers: 0, queued_requests: 3}
stats.(state2) # %{available_workers: 5, queued_requests: 0}

My guess is that state.lazy + :queue.len(state.resource) is the number I'm looking for. Maybe you are right that this is best exposed by nimble pool itself (in case the implementation changes).

aymanosman commented 2 years ago

I just realised I didn't answer your question.

We use this information to help us decide whether to scale up certain services.

oliveigah commented 2 years ago

This workaround using lookup_pool and get_state might work but as you mentioned is very fragile since you are coupling on the internals of the nimble pool.

Personally I've never used :sys.get_state beyond debugging so I don't know if there is some other side effect of using it in production.

Probably the best solution would be add some function on nimble_pool and use that in combination with finch's lookup_pool instead of :sys.get_state

Anyway I think this issue may be better suited in nimble_pool instead of finch, so we can get input from their maintainers. What you think?

aymanosman commented 2 years ago

Yea, the use of :sys.get_state was just to illustrate the point that the data looks to be easily available.

I do think that this is probably best dealt with in nimble_pool, as state.lazy and state.resources are private bits of state.

oliveigah commented 2 years ago

Great! I can work on this next week after fix my computer! 😅

If you want to open an issue on nimble pool would be good, so we can get input from theirs maintainers about the feature.

After it's done on nimble_pool we might expose the info through some finch API.

Makes sense to you @aymanosman?

Would be good to also gather some input from @sneako! What's your thoughts on this Nico?

sneako commented 2 years ago

Thanks for starting this discussion!

I tend to rely more on system level metrics like load and latency for making decisions around scaling, but I do think that exposing these pool stats could help users configure their pool sizes and counts so I am interested in exploring this further.

I don't see any reason to expose this data for HTTP1, but not HTTP2 despite the fact that the available stats may vary slightly. For HTTP1 pools we have a configurable size which does not apply to HTTP2, while HTTP2 pools (connections really), have a state (:connected, :disconnected, :connected_read_only, etc).

I could imagine this being implemented via the existing Telemetry events together with an ETS table for example, but it is probably better to go directly to the source of truth as has been suggested.

So we would need to update both NimblePool and Finch.HTTP2.Pool in similar fashions to expose this information.

Let's start thinking about API proposals for this, particularly what the data we return would look like. Poolboy just returns a 4 element tuple, but I think a struct could be nicer, what do you think?

iex> Finch.pool_status(url, MyFinch)
{:ok, struct_or_list_of_structs}

oliveigah commented 2 years ago

I don't see any reason to expose this data for HTTP1, but not HTTP2 despite the fact that the available stats may vary slightly.

I agree with you! Since we moved from the narrower case of available workers (which does not make much sense in HTTP2 AFAIK) to a broader case regarding not only available workers but the pool status as a whole in order to improve pool configuration tuning, I think it makes sense to implement it for both HTTP1 and HTTP2.

I could imagine this being implemented via the existing Telemetry events together with an ETS table for example, but it is probably better to go directly to the source of truth as has been suggested.

That could be another workaround that may work for you @aymanosman while the final version is not ready. 😄

Let's start thinking about API proposals for this, particularly what the data we return would look like. Poolboy just returns a 4 element tuple, but I think a struct could be nicer, what do you think?

I think we should have a different structs for HTTP1 and HTTP2 pools because as you mentioned the stats may vary between both implementations.

It is still not 100% clear to me what kind of information we must expose, maybe I could use some help understanding the use case behind this, as I am understanding now this pool_status function will be used mainly to identify performance bottlenecks that could be addressed by changing configuration options on the pool, is that correct?

ettomatic commented 2 years ago

maybe I could use some help understanding the use case behind this

Our application essentially proxies many origins and so we use multiple pools, it currently uses MachineGun/Poolboy.
We are using the pools data to track the health of our origins and in general to better understand the pressure on our proxy which usually experiences high traffic, including unexpected spikes against some of the origins. Metrics indicating that pools are close to exhaustion flip our internal circuit breaker. We also scale based on a number of metrics including pools saturation, we were also considering dynamically adapting the pool size to the current load. The ultimate goal is to isolate a subset of the traffic in case of overload. If Finch/Nimble Pool could provide similar instrumentation we could keep our current approach. Maybe there is another way you would advise to check for pool saturation which doesn't involve exposing internal details, but different options/advice is greatly appreciated.

Hope this can give you a bit more context, thanks for helping!

mattbaker commented 1 year ago

So we're in a similar boat, we really would like to know the state of the pool. We're focused entirely on HTTP1, no HTTP2 for us currently.

I could imagine this being implemented via the existing Telemetry events together with an ETS table for example, but it is probably better to go directly to the source of truth as has been suggested.

We've actually tried to do this and we're struggling a bit to get accurate numbers. I suspect this may boil down to using the wrong telemetry events to try and increment or decrement the number of "in use" connections in the connection pool.

@sneako if you were using existing telemetry events to try and track the free/used state of a pool which events would you use? I see our current implementation treats [:finch, :queue, :stop] as a check-out (-1 connection free in the pool) and [:finch, :connect, :stop] as the check-in (+1 connection free in the pool) but I have my doubts that this is what we should be doing 🤔

The main question we want to be able to answer is "right now what percentage of our pool is free vs occupied"? We're trying to tune pool sizes because we're reducing the # of running instances of our proxy, but fewer instances means more pressure on each instance's connection pool. Without a good sense of how our pool usage looks right now it's hard to know how much we should increase pool sizes by as we reduce running instances of our proxy elixir app.

If this belongs in an elixir forum post let me know :) But tl;dr — we would really benefit from a way to report the state of the pools to our New Relic instrumentation! We'd love to have dashboards, etc. reflecting current pool state.

mattbaker commented 1 year ago

Let's start thinking about API proposals for this, particularly what the data we return would look like. Poolboy just returns a 4 element tuple, but I think a struct could be nicer, what do you think?

Love the idea of a struct! Easy to grow over time if we decide more things should be in there.

Finch.pool_status(url, MyFinch) also seems reasonable to me :)

mattbaker commented 1 year ago

Another thought — as a stop-gap, telemetry events for checkin and checkout would make it easier to maintain a running total of connections in use (the ETS table idea). I could try to mock that up if it makes sense.

Obviously a snapshot of NimblePool's state would be the absolute ideal, but it looks like we may need a change in NimblePool as well as Finch both to support it?

Sorry for three comments in a row 😂 I promise it's my last!

aselder commented 1 year ago

this would also be really useful for us. let me know if there is anything I can do to help.

oliveigah commented 1 year ago

Sorry for the big delay! Gonna take a look on this problem this weekend. I think no change on the nimble_pool side will be nedded, because I think we can generated all metrics using existing callbacks.

I propose some struct like this:

%Finch.HTTP1.Pool.Metrics{
available_resources: 7, # how many connections are available on the pool
in_use_resources: 3, # how many connections have been checked out and did not checked in again yet
average_checkout_time: 200, # the average of how long it took to checkout a connection
max_checkout_time: 600, # the longest it took to checkout a connection
average_usage_time: 50, # the average of how long it took to return a connection to the pool after a checkout
max_usage_time: 500 # the longets it took to return a connection to the pool after a checkout
}

All time metrics will be measured in microseconds.

What you think about this metrics? @aselder @mattbaker @ettomatic @aymanosman @sneako

sneako commented 1 year ago

Sorry for the big delay! Gonna take a look on this problem this weekend. I think no change on the nimble_pool side will be nedded, because I think we can generated all metrics using existing callbacks.

I propose some struct like this:
%Finch.HTTP1.Pool.Metrics{

available_resources: 7, # how many connections are available on the pool

in_use_resources: 3, # how many connections have been checked out and did not checked in again yet

average_checkout_time: 200, # the average of how long it took to checkout a connection

max_checkout_time: 600, # the longest it took to checkout a connection

average_usage_time: 50, # the average of how long it took to return a connection to the pool after a checkout

max_usage_time: 500 # the longets it took to return a connection to the pool after a checkout

}
All time metrics will be measured in microseconds.

What you think about this metrics? @aselder @mattbaker @ettomatic @aymanosman @sneako

Thank you for looking in to this! I have mostly been relying on the queue duration metrics to estimate how busy/loaded a pool is, but something like this would be extremely useful.

Where do you propose this state would be stored?

oliveigah commented 1 year ago

Where do you propose this state would be stored?

Regarding performance in order to avoid ETS updates I was thinking to use atomic counters to increase values and calculate the measurements only when the pool_status is called.

That said, I think the reference to such atomics might be stored on :persistent_term on the pool startup. like this:

available_resources_ref = :atomics.new(1, [])
:atomics.put(available_resources_ref, 1, queue_size)
:persistent_term.put({Finch, finch_name, :available_resources},available_resources_ref)

and on then on the callbacks we can do something like:

# on checkout
available_resources_ref = :persistent_term.get({Finch, finch_name, :available_resources})
:atomics.sub(available_resources_ref, 1, 1)

# on checkin
available_resources_ref = :persistent_term.get({Finch, finch_name, :available_resources})
:atomics.add(available_resources_ref, 1, 1)

and on the new pool_status function:

available_resources_ref = :persistent_term.get({Finch, finch_name, :available_resources})
# other refs here
%{
available_resources: :atomics.get(available_resources_ref, 1),
in_use_resources: :atomics.get(in_use_resources_ref, 1),
average_checkout: :atomics.get(total_checkout_time_ref, 1) / :atomics.get(total_checkouts_qty_ref, 1)
}

The idea is this. But gonna see if it fits in at the code.

Also, gonna probally use only one atomics and use predefined indexes for each metric in order to call :persistent_term.get({Finch, finch_name, :available_resources}) only once and do all the atomic operations on different indexes.

What you think about this?

sneako commented 1 year ago

Atomics should be good, but don't we want metrics to be per pool rather than per Finch instance?

oliveigah commented 1 year ago

You are right! I think we should use shp tuple instead of finch_name. This solve it right?

sneako commented 1 year ago

Good idea! Finch name plus shp should be good, in case two different finch instances end up with pools for the same shp

sneako commented 1 year ago

Good idea! Finch name plus shp should be good, in case two different finch instances end up with pools for the same shp

oliveigah commented 1 year ago

Great! If you don't have any initial problem with the proposed implementation and interface. Gonna tackle it this weekend and hope to have a PR soon. (:

oliveigah commented 1 year ago

First draft already available @sneako! https://github.com/sneako/finch/pull/248

Please let me know if there is any major problem that you want to be addressed before I move to a more extensive documentation and HTTP2 implementation.

In the meantime gonna work on the pool termination callback on the nimble_pool side. We probally gonna need it here.

sneako / finch

pool telemetry: available workers #183