sneako / finch

Elixir HTTP client, focused on performance
MIT License
1.23k stars 114 forks source link

Decrement in_use_connections when a connection is not reusable #267

Closed pcorey closed 3 months ago

pcorey commented 4 months ago

Fixes #257

sneako commented 3 months ago

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough :thinking:

oliveigah commented 3 months ago

Thanks for looking at this @pcorey ! @oliveigah do you mind taking a quick look at this? I'm surprised this seems to fix the issue. To me it looks like this could cause the values to become negative, but maybe I'm not thinking this through enough :thinking:

Sure! Gonna take a look at this today after work. When I have more info I will post it here.

sneako commented 3 months ago

Thank you so much! No rush at all :pray:

oliveigah commented 3 months ago

After gathering some context I agree with @sneako! It's not clear to me that this change solves the problem, in fact I did a test considering idle connection termination and this changes breaks it. The test is in this gist: https://gist.github.com/oliveigah/cbe7fa1ecb711a6006af5dd05c6803e7

The assert that breaks is the third call to get_pool_status :

assert {:ok,
            [
              %PoolMetrics{
                pool_index: 1,
                pool_size: 10,
                available_connections: 2,
                in_use_connections: 8
              }
            ]} = Finch.get_pool_status(finch_name, shp)

The result with the proposed changes are:

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 10, in_use_connections: 0, pool_index: 1, pool_size: 10}]}

This happens because when the connection is idle, every time the user makes a checkout 2 calls to PoolMetrics.maybe_add/2 are made, one with -1 and other with 1 and in fact the metric is temporarily negative, and will be permanentlly negative after the checkin.

{:ok, [%Finch.HTTP1.PoolMetrics{available_connections: 18, in_use_connections: -8, pool_index: 1, pool_size: 10}]}

In summary, I think this change does not address the real problem we might have here.

About the root cause of the problem my thoughts are:

Gonna try to emulate some other scenarios to see if I can find any other possible root cause. @pcorey did this pool have a high throughput? I'm also suspecting about some weird interaction between metrics and pool termination, can you check if you have pool termination events in this pool?

pcorey commented 3 months ago

Thanks for looking into this @oliveigah! To add to the confusion, my "fix", as you guessed, didn't actually work. I was conflating metrics from three different tenants. Two of those tenants are behaving normally; available_connections stays around 50 and in_use_connections stays around 0.

The third tenant is the one having issues. Here's a plot of it's available_connections (blue) vs in_use_connections (orange) in the 24 hours after a fresh deployment with mainline finch (0.18.0):

Screenshot 2024-03-18 at 6 07 36 AM

This tenant's Finch pool does experience higher load than the other two. The current request pattern is ~1 hour of 30-50 requests/minute followed by ~20 minutes of 1-5 requests/minute:

Screenshot 2024-03-18 at 6 16 02 AM

all of them are errors and might be noisy on your logs

I'm not seeing any strange or unexpected errors in my logs. The only thing that may possibly be related is that all three of my tenants get connection closed errors from the upstream host (%Mint.TransportError{reason: :closed}).

can you check if you have pool termination events in this pool?

I haven't had any pool termination events ([:finch, :pool_max_idle_time_exceeded]) in the time range we're looking at.

oliveigah commented 3 months ago

Thanks for the details @pcorey!

The only problem I can see is some unexpected error on the Mint.HTTP.request/5 call on Finch.HTTP1.Conn https://github.com/sneako/finch/blob/fb52686f9901ef7e1e04be03dd7b9ecb1549b686/lib/finch/http1/conn.ex#L115-L122

I've done a test that artificially raise when calling it and then I got the same behaviour you described! The test is here: https://gist.github.com/oliveigah/a7cf1c7931440ba6c647866fd3e99d74

The best way to solve this is to add a new callback to nimble pool when the request is canceled https://github.com/dashbitco/nimble_pool/blob/deae8ffdb52ce5265a68299eea3e5074f1e8d5a7/lib/nimble_pool.ex#L443-L444.

I've implemented a naive version of this new callback locally and it solves the problem. Gonna open an issue on nimble pool in order to gather some thoughts about this new callback.

Then we can use it here and see if it solves your problem! Is it ok to you @sneako?

oliveigah commented 3 months ago

PR was merged on nimble_pool. https://github.com/dashbitco/nimble_pool/pull/44

Since the maintainer asked for us try this fix before releasing a new version of nimble pool, I've implemented it on my fork https://github.com/oliveigah/finch using it from master instead of hex.

All of my previously broken tests have been fixed with the changes, but if you could test it on your system and report the results would be awesome @pcorey! Let me know if I can help with anything.

pcorey commented 3 months ago

@oliveigah I'm deploying with your fork now. I'll let you know how things look first thing Monday morning. Thanks for doing all of this!

pcorey commented 3 months ago

@oliveigah Everything is looking great in my metrics. It looks like your fix worked! Should I close this PR and you can submit your own?

Screenshot 2024-03-25 at 9 36 35 AM
oliveigah commented 3 months ago

Awesome @pcorey, thanks for the feedback!

Let's wait for a new nimble pool's relelase and then I can subimit the PR with the handle_cancelled/2 implementation and the added tests.

oliveigah commented 3 months ago

I've opened the new PR with the implementation of the new callback and added tests. I think we can cotinue this conversation over there. https://github.com/sneako/finch/pull/268