Open PragTob opened 4 months ago
Hi Toby! Thanks for the kind words and for starting this discussion. I agree we should try to provide some more information around this.
The way I think about my HTTP1 pool sizes is this:
For the best connection reuse, you want to use the lowest pool count possible.
The size isn't that important since connections are established lazily. Size is more like an upper bound just to make sure you don't end up opening more connections than you want to. I usually set it to something around 100 more or less.
When your application is highly concurrent, with a single pool you will start to see queue timeouts or message queue pile ups in the pool process (mostly depending on how you configure pool_timeout) as the pool process is no longer able to dish out connections quick enough. This is the point when I start to increase the pool count, to alleviate pressure caused by high concurrency on the individual pool processes.
Unfortunately, I have not found a useful formula to calculate optimal size & count, but it would totally make sense that you should be able to apply little's law for this. I think you would probably also have to account for how long it takes to establish connections and the specific server's keep-alive configuration.
@sneako thanks a ton for your quick response! :green_heart:
Ok that's kind of what I gathered so basically more pools is better when the management work of a single pool may become too much for a single process - is how I'd summarize it right now.
Funnily enough, I did bump size to 100 and count to 4 yesterday. I'm expecting 3x traffic today so will bump again, but probably only the count then :thinking:
I'll see to activate telemetry metrics in the coming week to more accurately report this and see to find an a good config/equilibrium.
I might dive into the code to understand usage better. Do you have a high level view of how the different pools are used for scheduling. Like, I have 4 HTTP pools - which one is used for the next HTTP call? Does it just schedule doing round-robin or select the one with the lowest active connections? :thinking:
Again, thanks a ton!
You could try to use the new pool metrics feature to get an idea of how many connections you are actually using in each pool: https://hexdocs.pm/finch/0.18.0/Finch.HTTP1.PoolMetrics.html
Right now, the pool is chosen randomly.
I did implement round robin pool selection a while back, but in my testing it did not perform better than the current strategy, but maybe it is worth revisiting at some point: https://github.com/sneako/finch/pull/45
Lowest active connection pool selection is more feasible now too, thanks to the pool metrics, however I fear that the bookkeeping overhead might outweigh the benefits.
It is also totally possible that these smarter pool selection strategies might help more in certain cases, for example when connections are very expensive/slow to open. My tests did not cover that scenario.
Yeah I was looking at the pool metrics to find out more and shall hopefully implement that next week :)
Thanks for the peek behind the curtain. As a Monte-Carlo Tree Search enjoyer I love randomness :grin: But yeah I can see different strategies, esp. let's say if a system is constantly at/around capacity. Like, if the pool was too full you could redraw or just always pick a pool underneath a given threshold (if available). But yeah, that's all overhead.
Or theoretically I guess, pools could be started dynamically once pools get too "full".
Well, that's just all me theorizing I haven't built something like this yet. Thanks for your insight and work!
Ah yes "autoscaling" pool size & count is another feature on my wish list haha.
I just shared a google sheet with the results of the benchmarks mentioned, with the email listed on your profile. Let me know if I should share it with a different email.
These were performed with https://github.com/sneako/httpc_bench/tree/finch-only (possibly a different branch on this repo).
httpc_bench is quite old at this point and I believe would need some updates to run on modern OTP versions.
:wave: Hello there and thank you for
finch
and all the work around! :green_heart:I'd love some more information/documentation around the pool configuration options.
In particular my question is about pool size vs. pool count on HTTP1.
Like, what's the difference between having a pool size of 200 vs. a pool size of 50 and a pool count of 4 (which would also be 200 total connections). It's unclear to me how to optimize/balance these and am looking for some insight - and at best by updating the docs.
Looking at
NimblePool
I can gleam this from the downsides:Meaning, the management of resources/checkouts may take too long if one pool was too big. Of course, hard to say what too big means.
I'd also be interested in knowing how a scenario with multiple pools is handled. Like, how is it decided which pool is used? Do we go round robin around the pools?
I think that information in the docs would be really appreciated, at least by me but also by others. Happy to try & write it myself - but for that I'd need to know about it :sweat_smile:
Background
I don't want to or need you to solve my problem, I know for that I'd go to elixirforum or the slack but I thought it might add some color to the issue for context for what I'm looking for. Feel free to ignore :)
I'm running essentially an "intelligent" proxy and so make a good chunk of HTTP requests. Roughly 180requests/second. Requests take between 50ms and ~2000ms (a small number times out after 15seconds, like 10). Roughly a quarter takes longer than 300ms.
I'm using finch via Req. First on this traffic (predictably) Finch failed with:
a couple of hundred times. Using "Little's Law" I estimated a queue size of ~72. Configured the pool size to 80 and count to 3 (so over provisioned a lot as that should give me 240).
Still got ~10 of the above errors (all at the same time) after that change which surprised me - as I thought I had sufficiently over provisioned the pools for the current traffic (I actually thought I configured it safely for ~2x the current traffic).
FWIW this was happening on the tiniest of EC2 instances (1 CPU). CPU% average ~35%/max 71% - Memory below 25%.
I'll bump numbers more but I think more understanding of how pool size, count and scheduling worked would help me. Also, yes I'm working on getting telemetry metrics to see what's wrong :sweat_smile:
Thank you
Thanks again! :green_heart: