sneako / finch

Elixir HTTP client, focused on performance
MIT License
1.23k stars 114 forks source link

Remove a connection if it receives any messages while in the pool #273

Open keathley opened 1 month ago

keathley commented 1 month ago

Inspired by #272 I started looking into issues where idle connections are closed but not removed from the pool. Deep down I still feel like we need a packet capture to really suss out what's going on. Even if the pool hadn't processed a message telling us the tcp socket was closed, we should get an einval error when we attempted to put the connection into passive mode (at least in my testing on macOS).

That said, it occurred to me that a connection sitting in the pool should not be receiving data (since its finished the requests it needed to make and is back in the pool). Thus we can treat any message intended for that connection as an error and remove it from the pool.

Let me know what you think @sneako or if I missed something here.

sneako commented 1 month ago

Hey @keathley ! Thank you so much for taking a lot at this. That definitely makes a lot of sense to me and seems pretty obvious in retrospect.

I can deploy this PR to a production environment that makes extremely heavy use of Finch H1 pools and test out this hypothesis.

I'll report back with results ASAP!

sneako commented 1 month ago

I deployed this to a single node in a cluster where I see occasional :closed errors.

The good news is that I am not noticing any negative side effects of this change.

On the other hand, there doesn't seem to be any difference in the rate of :closed errors. Below are two charts, one showing the :closed error count for the entire cluster, and the other showing the error count just on the test node.

The vertical gray bar just after 13:00 marks when I deployed this change to the test node.

image

Now, each of these nodes was making > 7k Finch H1 requests per second during this test, so this :closed error rate is really quite negligible for me, but I assume the error rates some users are experiencing are much higher, so maybe this change will have more of an impact for those users.

sneako commented 1 month ago

Definitely interested to see if this change helps anyone else who is seeing a higher rate of :closed errors. If anyone experiencing this issue could test out this PR and report back that would be greatly appreciated. As stated above, I can attest that no negative side effects were observed when I deployed this change to prod, so it should be a completely safe test to run for anyone interested.

elijahkim commented 1 month ago

Hi! Coming from https://github.com/wojtekmach/req/issues/312#issuecomment-2145270023. I tried running this branch in a preview environment. I made an API request, waited a bit, and made another one which resulted in this error. I'm going to obfuscate some stuff but this is what I did.

Interactive Elixir (1.16.1) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> API.print("ID")
{:ok, <<37, 80,  ...>>}
iex(2)> NaiveDateTime.utc_now()
~N[2024-06-04 22:52:06.582967]
iex(3)> API.print("ID")

22:59:12.668 [error] retry: got exception, will retry in 1000ms, 3 attempts left

22:59:12.668 [error] ** (Req.TransportError) socket closed
{:ok, <<37, 80,  ...>>}
iex(4)> NaiveDateTime.utc_now()
~N[2024-06-04 22:59:30.035642]
iex(5)> API.print("ID")
{:ok, <<37, 80, ...>>}
iex(6)> NaiveDateTime.utc_now()
~N[2024-06-04 23:03:43.808728]
iex(7)> API.print("ID")

23:10:13.518 [error] retry: got exception, will retry in 1000ms, 3 attempts left

23:10:13.518 [error] ** (Req.TransportError) socket closed
{:ok, <<37, 80,  ...>>}
iex(8)> NaiveDateTime.utc_now()
~N[2024-06-04 23:10:21.185392]
iex(9)> API.print("ID")
{:ok, <<37, 80,  ...>>}
iex(10)> Hiive.Persona.Inquiries.print("ID")
{:ok, <<37, 80,  ...>>}

subsequent calls without waiting don't fail.