graphiti-api / graphiti

Stylish Graph APIs
https://www.graphiti.dev/
MIT License
960 stars 138 forks source link

Add thread pool and concurrency_max_threads configuration option #470

Closed MattFenelon closed 3 months ago

MattFenelon commented 3 months ago

This option allows to limit the maximum number of resources that can be sideloaded concurrently. With a properly configured connection pool, this ensures that the activerecord's connection pool is not exhausted by the sideloading process.

The thread pool configuration is based on ActiveRecord's global_thread_pool_async_query_executor.

Closes #469

See: https://github.com/rails/rails/blob/94faed4dd4c915829afc2698e055a0265d9b5773/activerecord/lib/active_record.rb#L279-L287

MattFenelon commented 3 months ago

Is it possible the test failure is flaky? It doesn't look related to the changes.

jkeen commented 3 months ago

@MattFenelon yep, I re-ran it and the tests pass now.

I've run into some thread pool issues on the app I use graphiti with, but never traced it back to the sideloads… so this is an interesting find!

It looks like the new code sets the default at 4? I'm not seeing where that default was set before your change, and just want to ensure that this doesn't do anything unexpected for people who are running just fine. In your configuration you set the value at 1?

MattFenelon commented 3 months ago

@MattFenelon yep, I re-ran it and the tests pass now.

Nice. Thank you!

It looks like the new code sets the default at 4? I'm not seeing where that default was set before your change, and just want to ensure that this doesn't do anything unexpected for people who are running just fine. In your configuration you set the value at 1?

That's a little tricky to answer :) The short answer is that the code was using Concurrent::Promise and that uses concurrent-ruby's global io thread pool, which has a max size equal to Java's max int: https://github.com/ruby-concurrency/concurrent-ruby/blob/8b9b0da4a37585ce5eb71516aca55e93bde39115/lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb#L15.

The problem with the default is that people will likely see these threading issues pop up so I think the default needs to be set lower. However, lowering the default will mean that people may see less concurrency happening than they were before. I went with 4 as that's what Rails' uses as its default pool size for async queries. But it's fairly arbitrary.

I have tested it configured to 1 and that works happily because the fallback option :caller_runs just runs on the calling thread. That also means no one should see any errors from the change because the calling thread will just be used to run anything that goes beyond the queue/pool size. That's why I didn't want to set the default to 1 because the concurrency will be limited to 1 thread / 1 sideload.

Perhaps it would be worth updating the docs to tell people they need to set their db connection pool appropriately otherwise even with this change they may see threading issues.

wdyt?

jkeen commented 3 months ago

@MattFenelon Thanks for the details.

The problem with the default is that people will likely see these threading issues pop up so I think the default needs to be set lower. However, lowering the default will mean that people may see less concurrency happening than they were before. I went with 4 as that's what Rails' uses as its default pool size for async queries. But it's fairly arbitrary.

But your theory is that if they're seeing those errors with the new explicit default set at 4, they'd have been seeing them with the implicit default being at 4 currently?

I only recently became a maintainer on this repo and don't yet have access to editing docs. @richmolj's time is totally consumed by a tsunami of life so I'm not sure when he'll get me plugged into that side of things. https://www.graphiti.dev/guides/concepts/resources#concurrency

If I'm understanding the issue correctly though, will having concurrency = true (default in rails) and the new concurrency_pool_max_size option set to 1, is that effectively the same as having concurrency = false ?

MattFenelon commented 3 months ago

If I'm understanding the issue correctly though, will having concurrency = true (default in rails) and the new concurrency_pool_max_size option set to 1, is that effectively the same as having concurrency = false ?

Sorry, I could have been clearer. 1 will set the pool size to 1 so there'll be 1 thread in the pool to be used for side loading. The max queue size would be 1 * 4 so any side loads above that will just be loaded using the calling thread.

If you want to provide a maximum queue size, you may also consider the fallback_policy which defines what will happen if work is posted to a pool when the queue of waiting work has reached the maximum size and no new threads can be created. Available policies: ... caller_runs: The work will be executed in the thread of the caller, instead of being given to another thread in the pool. From https://ruby-concurrency.github.io/concurrent-ruby/master/file.thread_pools.html

But your theory is that if they're seeing those errors with the new explicit default set at 4, they'd have been seeing them with the implicit default being at 4 currently?

That's it exactly. There's basically no safe default without configuring the database pool correctly. This just lowers the possibility as there'll only ever be 4 threads side loading at once rather than DEFAULT_MAX_POOL_SIZE. But I think setting concurrency to false by default to be safer is probably a more surprising change at this point. What do you think?

We could set the default to 1 but even so there's still a chance the pool will become exhausted. For example, with puma set to 5 threads and the pool set to 5, if the system is processing 5 requests at the same time the connection pool is already exhausted and sideloading any resources at that point will raise the error. But 1 does lower the risk at the expense of less performance. I'm not sure what the right balance is.

jkeen commented 3 months ago

I think your logic is sound? I'm not sure what the best default is that would hit that mark of best performance without issues. I feel tiny bit out of my depth when it comes to thread-related issues, but one thing that jumps out at me is that if this new setting is just for concurrency relating to sideloads, I think naming the configuration option to more clearly reflect that would be helpful. Graphiti.config.sideload_max_threads, perhaps?

MattFenelon commented 3 months ago

I think naming the configuration option to more clearly reflect that would be helpful

Sounds good. I went with concurrency_max_threads just because the concurrency setting already says it's for sideloads and I think it's nice that the settings sit next together alphabetically.

https://github.com/graphiti-api/graphiti/blob/f68b61ff09ec61ecf23acc5bc37d0accba14aeed/lib/graphiti/configuration.rb#L7-L9

Thanks for the reviews on this!

MattFenelon commented 3 months ago

I just realised that the executor initialisation wasn't thread-safe so I've fixed that.

jkeen commented 3 months ago

@MattFenelon Oh didn't realize that re: the concurrency flag. Your naming makes perfect sense then—nice work, and good catch on the thread-safety issue.

I fixed a silly lint error only the CI robots care about, but the rest looks good to me and I'll cut a minor release with this change. Thanks for the contribution!

MattFenelon commented 3 months ago

I fixed a silly lint error only the CI robots care about, but the rest looks good to me and I'll cut a minor release with this change. Thanks for the contribution!

Ha thanks. My editor really wants to auto format a lot of that code 😅

Looking forward to the release!

github-actions[bot] commented 3 months ago

:tada: This PR is included in version 1.6.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

jkeen commented 3 months ago

@MattFenelon I had to make a small change to the thread-safety logic (1.6.1) after running into some major lockups in production (as in, the web server would stop responding) with the change as it was. When trying it out locally, the change I made to how the mutex was set up made all the difference, but redeploying this just now still causes a lock-up in production when concurrency is enabled? I don't know what/why that is happening? I'm at the end of a long day here, but if you see this in the morning see if anything jumps out at you?

MattFenelon commented 3 months ago

Right ok. I think it might be because the mutex variable is not actually global but an instance variable on the class. It needs to be global to work properly. We could make it a constant or a class variable ‘@@mutex’. I just can’t remember all of the inheritance issues with constants and class variables off the top of my head.I’m not at a computer right now but I can raise a PR in a bit.

MattFenelon commented 3 months ago

@jkeen I've raised a new PR to fix the issue. I can't say for certain that it was what caused the issue you saw but it seems likely.

MattFenelon commented 3 months ago

Thanks again for the merge @jkeen. Apologies for the muck up there.

jkeen commented 3 months ago

@MattFenelon No problem at all, thanks for the quick turnaround!