Sporadic nil values from RequestStore after upgrade to 1.6.0

mattbrictson commented 6 months ago

I recently upgraded a Rails app to request_store 1.6.0. Immediately after deploying, I started seeing sporadic errors in Sidekiq jobs that use RequestStore.

Occasionally (< 0.5% of jobs), a previous stored value will "disappear" and return nil.

RequestStore.store[:key] = value
# sometime later in the sidekiq job, inside an Active Record `after_save` callback...
RequestStore.store[:key] # => nil

I am using:

ruby 3.2.3 (2024-01-18 revision 52bb2ac0a6) +YJIT [x86_64-linux]
rails 7.1.3.2
request_store 1.6.0
request_store-sidekiq 0.1.0
sidekiq 7.2.2

I am 99% certain this is due to request_store 1.6.0, because rolling back to request_store 1.5.1 (with no other Gemfile.lock changes) makes the problem go away.

I realize this is a vague bug report and doesn't pinpoint a cause. If you have any troubleshooting tips I'd appreciate them. Thanks!

MattFenelon commented 6 months ago

It might be because of the change to use Fiber storage in https://github.com/steveklabnik/request_store/pull/90.

mattbrictson commented 6 months ago

Yes, it looks like #90 is the only significant change going from 1.5.1 to 1.6.0. However, my understanding is that Thread.current[] is already fiber-local, so I am having trouble wrapping my head around why a switch to Fiber[] would make a difference.

Perhaps it is because Fiber[] storage is shallow copied to child fibers. And so a fiber is holding onto a copied reference to the Fiber[:request_store] hash longer than it should?

My code never uses Fiber.new (nor do any of my gem dependencies, to my knowledge), but I am aware that Ruby internals apparently use fibers for things like lazy enumerators.

I am still confused as to exactly what is going on here.

steveklabnik commented 6 months ago

Hrm. Yeah I have tried to keep this library very stable, so this is confusing to me.

Unfortunately I don't really write much Ruby any more, so I don't have any good tips for figuring out what's going on. If someone figures it out, very happy to take a patch.

MattFenelon commented 6 months ago

Yeah, it could be the shallow copying of Fiber-storage to child Fibers. I believe that might cause an issue if the request store isn't initialised in the parent Fiber and is then used in a child Fiber, the child Fiber's storage won't be copied back to the parent and is lost after the Fiber completes. Is it possible you're setting something to request store in an iterator?

e.g.

Fiber.new do
  RequestStore.store[:test] = 1
end.resume

RequestStore.store[:test] # => nil

MattFenelon commented 6 months ago

I fixed a similar issue by calling RequestStore.store at the start of a request/job to initialise the parent Fiber's storage.

MattFenelon commented 6 months ago

On further thinking, you can ignore those comments. The previous use of Thread.current[] had the same behaviour because it is Fiber-local.

steveklabnik commented 6 months ago

I reached out on bluesky to see if anyone had ideas, and I got one reply

Oh that's a fun one. Saw similar issues in CurrentAttributes usage during sidekiq tests, but it ended up being related to how rack middleware behaves (a sentence that haunts me still).

... given this is rack middleware, that sounds scary...

mattbrictson commented 6 months ago

OK, turns out this is pretty serious and also easy to reproduce.

In 1.6.0, the same underlying storage hash is being shared across threads. This can be confirmed by inspecting the object_id of the hash. You can see in this example that the main irb thread and the new thread both see the same hash with ID 15160.

irb(main):001> require "request_store"
=> true
irb(main):002> RequestStore::VERSION
=> "1.6.0"
irb(main):003> RequestStore.store.object_id
=> 15160
irb(main):004> Thread.new { RequestStore.store.object_id }.value
=> 15160

In version 1.5.1, each thread gets a distinct storage hash, as evidenced by the different object_id in the irb thread (38440) versus the new thread (44880):

irb(main):001> gem "request_store", "1.5.1"
=> true
irb(main):002> require "request_store"
=> true
irb(main):003> RequestStore::VERSION
=> "1.5.1"
irb(main):004> RequestStore.store.object_id
=> 38440
irb(main):005> Thread.new { RequestStore.store.object_id }.value
=> 44880

Unless I am misinterpreting this, it means that something is fundamentally broken with thread isolation of data in 1.6.0.

A more direct example that shows a value leaking across threads:

irb(main):001> require "request_store"
=> true
irb(main):002> RequestStore.store[:foo] = "bar"
=> "bar"
irb(main):003> Thread.new { RequestStore.store[:foo] }.value
=> "bar"

steveklabnik commented 6 months ago

Okay, I think this is good enough reason to revert that PR and cut a 1.7.0 with the old semantics. What do you think?

mattbrictson commented 6 months ago

In practice, RequestStore.clear! called by the Rails middleware (or in the case of Sidekiq, the https://github.com/madebylotus/request_store-sidekiq middleware) should ensure that distinct hashes are used for each thread.

However, I found that clear! is called only at the end of each request/job.

That means that when the Sidekiq process is first started and it creates its processor threads, the first jobs that they process will share the same request storage. I confirmed this by printing the thread ID, fiber ID, and request store object ID at the beginning of the job. This output shows that 2 different processor threads initially share the same request storage (object ID 20320):

#<Thread:0x0000000110411598@sidekiq.secondary/processor run> #<Fiber:0x0000000110392a40 (resumed)> 20320
#<Thread:0x0000000110415648@sidekiq.default/processor run> #<Fiber:0x00000001103962f8 (resumed)> 20320

After the first jobs are processed, the clear! call in the middleware ensures that the processor threads subsequently diverge and use different request storage.

So this bug seems to be limited to the first request/job, before the middleware does its clear! cleanup operation.

A possible fix would be to modify the middleware to call clear! before each request, in addition to after.

mattbrictson commented 6 months ago

Okay, I think this is good enough reason to revert that PR and cut a 1.7.0 with the old semantics. What do you think?

The fact that request storage is inherited by newly created threads in 1.6.0 is a big and surprising change to me. The bug report in my original issue description could be fixed by tweaking the middleware as I mentioned, but it still stands that the inheritance behavior is quite different in 1.6.0. It feels like a 2.0 level of change. If this gem is in maintenance mode and trying to avoid big changes, then my suggestion would be to roll it back.

MattFenelon commented 6 months ago

I wasn't able to reproduce what you saw with sidekiq.

class TestJob < ApplicationJob
  def perform(**_other)
    puts "thread id: #{Thread.current.object_id} request_store id: #{RequestStore.store.object_id}"
  end
end

5.times { TestJob.perform_later }

outputs

thread id: 49000 request_store id: 49280
thread id: 49080 request_store id: 49340
thread id: 49100 request_store id: 49380
thread id: 48920 request_store id: 49420
thread id: 49060 request_store id: 49460

Is it possible you accessed the request_store in the main thread before running the jobs?

The RequestStore shouldn't be shared between threads if it hasn't been initialised in the main thread before creating the child threads. As far as I know, rails (with puma) and sidekiq will create threads for each request or job so the likelihood of having a shared request store seems low to me.

creating a thread after accessing the store, the stores use the same object:

RequestStore.store.object_id # => 48460
Thread.new { puts RequestStore.store.object_id }.join
# 48460

creating a thread before, the stores are different objects:

Thread.new { puts RequestStore.store.object_id }.join
# 48440
RequestStore.store.object_id # => 48460

I'd actually like to put a vote in for the new behaviour because it enables some extra functionality that wasn't possible before #90 was merged. Fiber[] allows child threads or fibers within a single request or job to access the request store, which to my mind is in keeping with the idea that the store is accessible within a single request. Some examples are of where this may be useful are:

Enumerator uses a Fiber internally for external iteration. With this change, the request store can be accessed within the iterator.
Running code concurrently using fibers or threads within a single request/job. With this change, the child threads or fibers can access the request store. E.g. making multiple API requests asynchronously within a single controller action.

One thing I've had to work around to get that to work is to initialise the request store before creating any child thread or fibers, otherwise the store isn't there to be copied to their fiber-storage.

mattbrictson commented 6 months ago

Is it possible you accessed the request_store in the main thread before running the jobs?

Yes, I am using the audited gem, which accesses the request store when Active Record model classes are eagerly loaded during app initialization. This is a pretty widely used gem, with 24M downloads.

I'd actually like to put a vote in for the new behaviour because it enables some extra functionality that wasn't possible before

I agree this new behavior is useful, but I think it warrants a 2.0 version bump, an upgrade guide, and a warning in the release notes. Right now, a simple mistake of accessing the request store "too early" (all too easy to do, especially if this happens inside gems without your control) can lead to leaking data across threads.

MattFenelon commented 6 months ago

Is it possible you accessed the request_store in the main thread before running the jobs?

Yes, I am using the audited gem, which accesses the request store when Active Record model classes are eagerly loaded during app initialization. This is a pretty widely used gem, with 24M downloads.

Ah ok. That goes against the idea of it being a request store but it is what it is 😅

I'd actually like to put a vote in for the new behaviour because it enables some extra functionality that wasn't possible before

I agree this new behavior is useful, but I think it warrants a 2.0 version bump, an upgrade guide, and a warning in the release notes. Right now, a simple mistake of accessing the request store "too early" (all too easy to do, especially if this happens inside gems without your control) can lead to leaking data across threads.

That would work.

smudge commented 4 months ago

Just got burned by this as well, following something similar to the "Usage with Sidekiq" recommendation in request_store's README. (Not with Sidekiq, but a similar background job use case where multithreading is involved, and where RequestStore had been enforcing global state isolation across job boundaries.)

The way request store state is inherited to child threads (and from the root/parent thread) has changed, which feels like a big enough breaking change to merit a 2.0.0 release... As such, I humbly request that y'all yank 1.6.0 and cut a 1.7.0 release with the prior implementation, per @steveklabnik's comment above:

Okay, I think this is good enough reason to revert that PR and cut a 1.7.0 with the old semantics. What do you think?

(Please let me know if I can help in any way! I'd be happy to rework my use case for a 2.0.0 Fiber-ready release! 🙂)

professor commented 4 months ago

Let's make life easier for @steveklabnik -- @smudge @MattFenelon @mattbrictson -- please review https://github.com/steveklabnik/request_store/pull/99 and add a comment on it if it looks good to you.

steveklabnik commented 4 months ago

Hi there folks! Welcome to my "I need to get to that" thing that I somehow have never gotten to. I want to ship this fix in the next few days at worst, though.

Basically, I have been terrified of releasing something that ends up introducing more problems. My initial thought was to just do what #99 is doing; revert the breakage. More confirmations that this fixes issues would lead me to be more confident I'm not gonna cause more issues for you all, and so if I hear back that stuff is good, I will merge and release promptly. And if not, well, it was what I had before, so at least any bugs will be the same as bugs before, heh.

Thank you and sorry again. The xz stuff happening around the same time as this happened was not exactly good for my mental health around this stuff, heh.

professor commented 4 months ago

@steveklabnik I'm biased -- with @smudge positive review for #99 -- I say ship it.

I don't think you did anything wrong, you are trusting the community to help support this gem. It isn't always going to be perfect, and we fail forward.

Thank you for still being involved.

steveklabnik commented 4 months ago

Sounds good to me, given this is basically a revert I am worried but not that worried.

And thanks for the kind words, and also the PRs, it does help quite a bit.

smudge commented 4 months ago

Seconding @professor! Appreciate everything you do @steveklabnik (as well as everything that this community does), especially when it comes to keeping 10+ year old rubygems like this alive for the countless apps that still rely on them. It's no small task.

ioquatix commented 4 months ago

It was probably a good call to revert if people are having major problems.

The correct solution is to delineate requests properly, e.g.

irb(main):007> Fiber[:request_store] = {x: 10}
=> {:x=>10}
irb(main):008* Fiber.new(storage: nil) do
irb(main):009*   puts Fiber.current.storage.inspect
irb(main):010> end.resume
nil
=> nil
irb(main):011* Fiber.new do
irb(main):012*   puts Fiber.current.storage.inspect
irb(main):013> end.resume
{:request_store=>{:x=>10}}
=> nil

Right now, the alternative (original approach) is using thread locals or fiber locals, which can leak between requests if you use a thread pool, and don't have a reset mechanism. So, unfortunately, we are between a rock and a hard place - I'm of the opinion that Fiber storage is the correct path forward, but there will be challenges while we work through the technical debt that exists.

I'm working to address this in Puma: https://github.com/puma/puma/pull/3101 and similar changes would be required to other application servers, things which model discrete, isolated requests in a shared memory system.

ioquatix commented 4 months ago

However, I found that clear! is called only at the end of each request/job.

IMHO, I think it makes more sense to call clear! at the start of the request, and optionally at the end. I suppose it depends on what you expect the life time of the "request" to be. It's a tricky problem. Fiber.new(storage: nil) do ... is kind of equivalent to calling reset at the start.

Regarding Fiber storage inheritance, that's also a tricky design decision. However, without such a mechanism, it's extremely hard to guarantee consistency. For example, if you have a request come in, and you assign a request ID which you want to emit in all logs, then create a bunch of threads and/or fibers to do work, do you want them to log under the same request ID?

Just to provide a little background, we have:

GLOBALS = {...}
Thread.attr_accessor :per_thread_variables
Fiber.attr_accessor :per_fiber_variables

Thread.current[:x] # also fiber local
Thread.current.thread_variable_set # thread local

Fiber[:x] # "request" scoped, including any child fibers or threads.

One of the goals of Fiber storage is to provide a standard mechanism for handling per-request state and removing the need for (bespoke/library specific) things like RequestStore.clear! (which is tricky to use, considering things like the Sidekiq middleware need to exist, etc).

steveklabnik commented 4 months ago

Glad you're working on making this sort of thing easy to do upstream; ideally this library wouldn't need to exist. I just wrote it to fix a bug I was seeing in Draper a decade ago, but having a true first-class concept sounds great :)

steveklabnik / request_store

Sporadic nil values from RequestStore after upgrade to 1.6.0 #96