Implement pool telemetry

oliveigah commented 8 months ago

Second proposal to solve https://github.com/sneako/finch/issues/183

After gather some feedback, I settled down on these 2 structs for the metrics:

   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }

   %Finch.HTTP2.PoolMetrics{
      pool_index: 1,
      in_flight_requests: 4
   }

The Finch.get_pool_status/2 interface also changed a bit. Now it returns a list of metrics struct, one for each pool started defined by the count option. Like this:

Finch.get_pool_status(MyFinch, "https://somehost.com:4321")

{:ok,
 [
   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   },
   %Finch.HTTP1.PoolMetrics{
     pool_index: 2,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }
 ]}

I've also added some fields to the pool state (such as finch_name and pool_index) in order to enable future changes more easily.

oliveigah commented 8 months ago

I agree with all suggestions! I'm currently on a business trip so probably won't be able to work on this until friday.

Regarding the averages and maxes, I think the potential margin of error is very low when we consider a large number of samples (which should be most cases).

The problem is if you take the measures without enough samples and therefore this margin of error may be more relevant.

One possible workaround would be only serve this measurements if there is enough samples, like 100. In this scenario even if the base metrics are slightly out of sync the real impact on the average/max measurements should not be relevant.

Anyway, I think is better to move on like you suggested with only available and in use connections and we can think more about this problem later.

Thanks for the feedback! (:

oliveigah commented 7 months ago

I'm working on the pool metrics implementation for HTTP2 and would like to know which metrics you think would be good to keep track of.

I'm thinking about in flight requests instead of available connections because it makes more sense on a multiplex scenario like HTTP2. There is another metric you think would be good @sneako?

oliveigah commented 7 months ago

Regarding finch configurations with a pool count greater than one. Since the multiple pools are not exposed to the user in any way, I think makes sense to aggregate metrics from multiple pools under a single counter and generate metrics as there is only one pool.

An alternative would be keep separate metrics for each started pool using some index strategy and just show it as a unified metric when the user calls Finch.get_pool_status (maybe even have some option to show it separately). This would enable different forms of load balancing such as least_conn when there is more than one pool but the implementation may be harder.

What you think? I'm good with both. Just let me known the approach that better suits your vision for the library @sneako! (:

sneako commented 7 months ago

Thanks @oliveigah !

I agree inflight requests make more sense for h2 :+1:

I think individual pool metrics will be more useful than only exposing the aggregates. I definitely like the idea of enabling new load balancing strategies. Individual metrics will also be more useful when users are configuring their pool sizes & counts. Any aggregation I assume will take place somewhere else in the observability stack (ie Datadog or similar)

oliveigah commented 7 months ago

I think the implementation is done. Gonna start to work on the docs now.

In summary:

No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state
Finch.get_pool_status/2 now returns a list of metrics struct. Each struct contains a pool_index attribute that represents the index of the pool.
I've kept some other attributes on the pool state in order to facilitate future changes. This may causes some conflicts with https://github.com/sneako/finch/pull/252. Gonna wait for the merge to solve it.
The only persistent term that still remains is the pool manager that keep all atomic refs related to a given finch instance and shp, this is used on the Finch.get_pool_status/2 call.
Both h1 and h2 are ready

Let me know what you think @sneako. (:

oliveigah commented 7 months ago

No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Gonna work on a fix for this, probably gonna need to store the reference on a persistent term and call it N times when the user call Finch.get_pool_status/2.

To add and subtract from the atomics I still think no persistent term call will be needed but the calls for Finch.get_pool_status/2 will need to execute N calls to persistent term instead of just one.

Maybe an ETS would be a better fit for what Finch.get_pool_status/2 needs. Since it's use case is far less performance sensitive than the add and subtract it may work. Gonna explore it a bit.

oliveigah commented 7 months ago

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Solved!

I think the PR is ready for review now! I've made some changes to some test functions to address some problems with flaky tests.

Mainly what I did is check for available ports dynamically and reuse the same server for all tests.

Let me know your thoughts! (:

oliveigah commented 6 months ago

Done @sneako! (:

Let me know if you see any other improvement to the proposed implementation.

sneako / finch

Implement pool telemetry #248