superfly / fly-ruby

Ruby gem for handling requests within a Fly.io multiregion database setup
BSD 3-Clause "New" or "Revised" License
86 stars 6 forks source link

X-CSRF issue #11

Closed Matt-Yorkley closed 3 years ago

Matt-Yorkley commented 3 years ago

Hey @jsierles :wave:

I've been trying out some deployments with a Rails app and I've set up fly-ruby but I'm getting issues with X-CSRF errors on all requests that are redirected to the primary region:

ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken)

...which makes sense, right? :thinking:

I can see in the logs that the request is correctly redirected from the current region to the primary region (LHR -> IAD in this case).

Have you encountered this before?

Matt-Yorkley commented 3 years ago

Relevant backtrace:

2021-09-22T18:44:08.395986509Z app[bad4fff7] iad [info] F, [2021-09-22T18:44:08.395656 #515] FATAL -- : [c338be64-fa0a-4c4f-8dda-bae1e71b8a16]
2021-09-22T18:44:08.396009642Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16] ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):
2021-09-22T18:44:08.396013540Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16]
2021-09-22T18:44:08.396016605Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16] actionpack (6.1.4.1) lib/action_controller/metal/request_forgery_protection.rb:211:in `handle_unverified_request'
2021-09-22T18:44:08.396018248Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16] actionpack (6.1.4.1) lib/action_controller/metal/request_forgery_protection.rb:243:in `handle_unverified_request'
2021-09-22T18:44:08.396022687Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16] devise (25124d5db77d) lib/devise/controllers/helpers.rb:255:in `handle_unverified_request'
2021-09-22T18:44:08.396024560Z app[bad4fff7] iad [info] [c338be64-fa0a-4c4f-8dda-bae1e71b8a16] actionpack (6.1.4.1) lib/action_controller/metal/request_forgery_protection.rb:238:in `verify_authenticity_token'
...
Matt-Yorkley commented 3 years ago

Since CSRF is session-dependent there's no way it can be valid when it's suddenly switched from LHR to IAD..? I guess there's two ways around this...

jsierles commented 3 years ago

Hey there! This should not be an issue with cookie-based sessions. Do you have SECRET_KEY_BASE set in Fly secrets? Also, are you using the default behavior of all non-GET requests being replayed?

Matt-Yorkley commented 3 years ago

Do you have SECRET_KEY_BASE set in Fly secrets?

Yep :+1:

cookie-based sessions

Aha, I'm not using cookies for sessions. Part of the reason is that I need access to sessions in Websockets, which I think don't play nice with cookies.

jsierles commented 3 years ago

I see. What are you using for a session store?

Matt-Yorkley commented 3 years ago

Redis, but currently isolated per-region.

jsierles commented 3 years ago

So cookie-based sessions should work with ActionCable.  See https://guides.rubyonrails.org/action_cable_overview.html#connection-setup. If you're doing websockets another way, I think the key is for the cookie to identify the user at connection time, before the connection is upgraded to websockets.

Redis-based sessions could work with global Redis, but they would add latency to every request. We have been experimenting with split a read/write Redis client but have not reached any conclusions yet.

Matt-Yorkley commented 3 years ago

Part of the issue with Websockets is they can be running bits of code that are outside of the traditional HTTP request cycle. I guess it's a similar thing for Jobs as well thought, right? What happens when a Job tries to write something to the database? There isn't a "request" there which can be replayed. :thinking:

I'm wondering if there's a lower-level option here in response to a PG::ReadOnlySqlTransaction. Would it be possible to directly retry the transaction on the primary database without going through HTTP..?

jsierles commented 3 years ago

Yes, so there are few things in play here. You can make requests to the primary database independently from anywhere, but they will be slower from secondary regions. Rails multiple database support can help with this setup.

You could avoid this problem with websockets by always connecting them to the primary region, and only use the secondary regions for standard GET requests.

For background jobs, the best solution is to place your job runners in the primary region only. This can be accomplished with Fly multiprocess support. We'll be working on more comprehensive guides on these topics.

jsierles commented 3 years ago

One other thing to consider - using region-local session stores could cause problems if users are sent to another region for any reason. This could happen if one region is failing, or someone gets on a plane to another region.

Matt-Yorkley commented 3 years ago

You can make requests to the primary database independently from anywhere, but they will be slower from secondary regions.

A cross-region database transaction would be significantly faster than a cross-region HTTP request that hits the entire stack though, right?

Rails multiple database support can help with this setup.

That's where I'll start looking next, thanks :heart:

Matt-Yorkley commented 3 years ago

Wowza! According to the edge guide I can just do this:

# app/config/database.yml
production:
  primary:
    <<: *default
    database: <%= ENV["DATABASE_NAME"] %>
    host: <%= "#{ENV['PRIMARY_REGION']}.#{ENV['DATABASE_HOST']}" %>
    port: 5432
  primary_replica:
    <<: *default
    database: <%= ENV["DATABASE_NAME"] %>
    host: <%= "#{ENV['FLY_REGION']}.#{ENV['DATABASE_HOST']}" %>
    port: 5433
    replica: true
    database_tasks: false
# app/models/application_model.rb
class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  connects_to database: {
    writing: :primary,
    reading: :primary_replica
  }
end

:partying_face:

leastbad commented 3 years ago

Despite what might be presented in the ActionCable guide, there's no way for a websocket to write a secure (session) cookie. While we're experimenting with some [hacky, HTTP fetch request-based] workarounds, this is the primary reason StimulusReflex strongly recommends that people use Redis for sessions: if you attempt to write to the session, it looks like it worked, but the changes don't stick - and it's silent!

As to the blasphemous :wink: suggestion that we set up our apps to run all of our websockets through a single AZ:

  1. No!
  2. Our apps run almost entirely through WS
  3. The whole reason we're here is to run our WS in multiple AZs

The header-based mechanisms that Fly offers for request-based architectures are really fucking exciting but @Matt-Yorkley touches on an incredibly motivating point: whether you're working in StimulusReflex or LiveView, we're trying to kick off a reactive, server rendered revolution. This means we all need to spend time thinking about asyncronous processes that happen outside of the request/response cycle first... the same way tech press had a collective fever dream a decade ago about how everyone needed to start building "mobile first" apps.

Thing is, we aren't just demanding: we're willing to roll up our sleeves and help build infrastructure. For example, I am willing to add attributes to support multi-region to StimulusReflex, in the same way that SR is now aware of one session with multiple tabs.

I don't think people hopping around the globe is a constant, all-of-the-time ordeal. It's an important edge case to support, sure. People travel. I have a vague intention to write and publish a gem that will make Devise multi-AZ aware. It will allow people to change their "home" AZ, in the same way that many BitTorrent download client sites do. Work to be done, but cool, exciting work.

It seems like what we're ultimately looking for is what Redis describes as CRDB: https://redis.com/redis-enterprise/technology/active-active-geo-distribution/ - an active-active, multi-AZ solution. What's not-at-all clear is how much of that is possible for those of us not running Redis Enterprise. Update: @Matt-Yorkley just alerted me to the existence of KeyDB which appears to hit the spot? Optimism returns!

The lack of active/active without $$$ is the same stressor that has made us lose optimism for Postgres and start looking at Yugabyte, which at least appears to be wire compatible with PG and does profess "free" active/active replication.

Getting to the point where it's as easy to deploy KeyDB and Yugabyte in active/active multi-AZ on Fly should open a portal to a higher dimension, opportunity-wise. I'm pumped.

jsierles commented 3 years ago

A cross-region database transaction would be significantly faster than a cross-region HTTP request that hits the entire stack though, right?

This depends a lot on what your stack does. If your default behavior is to replay all POST requests - or a subset - in middleware, before hitting your app, it will be a lot faster since the request is replayed at the Fly router level, backhauled over the Fly network.

The problem with cross-region database transactions is that the latency is compounded by the size of the query and its results. Any non-trivial query will appear a lot slower.

However, this may also not be an issue for work you can put into the background - work that doesn't need to be synchronous. Like refreshing an auth token, or updated User#last_visiited_at. You could even perform this kind of work in a background thread - something Discourse does alongside its home-grown websocket message bus implementation.

jsierles commented 3 years ago

Getting to the point where it's as easy to deploy KeyDB and Yugabyte in active/active multi-AZ on Fly should open a portal to a higher dimension, opportunity-wise. I'm pumped. 

People are experimenting with Yugabyte. There's an interesting thread on our community forum about this.

KeyDB deploys on Fly with some minimal setup but we have not done extensive testing. This sounds like a good use case! I'll give it a spin and report back here.

Is there anything else specific to SR that's worth baking into this library?

leastbad commented 3 years ago

That is an excellent question which shall receive some quality shower time in the near future.

I mentioned the potential of adding [something like] a region_id attribute to the package of metadata that accompanies each Reflex action. That said, it's not clear if this is actually necessary as Reflex classes have access to the request which created the WS via the ActionCable Connection... which means that we should be able to pick up the Fly region header already, without needing to send it with each action.

I think it's more relevant to consider how Fly and AnyCable/Pro could play together in an optimized way. @palkan I think you've been talking to the Fly folks already, right?

ActionCable itself really is the weakest, most bottleneck-y part of the conversation.

jsierles commented 3 years ago

We have been discussing AnyCable internally today. We'd love to be able to support it out-of-the-box as a standard component to accompany Rails, and mostly have the tools to do it.

With multi-process apps, you can scale and place a second process group independently. So now, I believe we could run a single deployment with 3 processes: normal Rails, Anycable Go, and Anycable RPC, all sharing a Redis instance (keydb or otherwise). The only caveat now is that the websocket URL would have to run on a non-standard port, but we may introduce path-based routing down the line.

I'll give this a try and see how it goes!

jsierles commented 3 years ago

One issue that may come up here is that KeyDB pubsub messages may not arrive in order on distant replicas, and latency will vary. However, I think those who are close to each other should see consistent results. Does StimulusReflex deal with ordering?

leastbad commented 3 years ago

No, sir.

One for the documentation.

Matt-Yorkley commented 3 years ago

Looking through the source code of Rails' multiple database handling I think there's a hell of a lot of juice in there that's begging to be extracted. I think I'll try hacking in that direction...

jsierles commented 3 years ago

It's a thorny topic because the multiple database support is designed for spreading load across nearby replicas - not for querying across regions. So your region-local replica queries will run fast, but writes will slow everything down. Doing stuff like find_or_create_by will run across these two regions. But, I do believe in a future where you could ask this driver to make the writes async, ad-hoc, like:

Fly.async_query do
  Post.create!
end 

You can follow some of this logic in long-form here: https://fly.io/blog/run-ordinary-rails-apps-globally/

palkan commented 3 years ago

Hey everyone! I'm pretty new to Fly, but I'll try to do some guessing.

The only caveat now is that the websocket URL would have to run on a non-standard port, but we may introduce path-based routing down the line.

Can we use an HTTP Proxy (say, NGINX) as an entrypoint which would route requests into a Private Network? Like:

server {
  listen 8080;
  listen [::]:8080;

  server_name example.com;

  location /cable {
    proxy_pass https://anycable-go.internal;
    proxy_http_version 1.1;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-Ssl on;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
  }

  location / {
    proxy_pass https://rails.internal/;
    proxy_set_header X-Forwarded-Host $http_host;
  }
}
jsierles commented 3 years ago

Yes, this is possible, but it adds another layer to your stack. Right now, I have a demo working that runs the cable service on port 8443, which I think is good enough for proving this out for now.

jsierles commented 3 years ago

OK, this demo app is running OK at the moment. It just broadcasts some messages, but it works globally :) KeyDB is also working on a Raft-based consensus, but this should work today for applications that do not have consistency or ordering requirements.

App code with setup docs: https://github.com/superfly/anycable-rails KeyDB deployment with docs: https://github.com/fly-apps/keydb Deployed at: https://su-anycable-rails.fly.dev

Matt-Yorkley commented 3 years ago

If you're interested @jsierles, I've been using this example app for testing various things on Fly: https://github.com/mra-clubhouse/clubhouse alongside this slightly customised Redis setup (one instance per region): https://github.com/mra-clubhouse/fly-redis/tree/custom-redis

I ended up using a multi-database configuration with explicit connection-switching in the end, mostly due to the way StimulusReflex interacts with controllers for rendering (it's a bit unusual).

jsierles commented 3 years ago

Cool, I'll take a look at that. I think among these various approaches we can find a good balance and build some solid guides. Maybe we can continue the convo on https://community.fly.io?

jsierles commented 3 years ago

Just to come back around to the original topic - is CSRF still an issue for you because of region-local session stores?

Matt-Yorkley commented 3 years ago

Nope, it's working now :ok_hand: I stuck with region-local Redis for sessions but I dropped fly-ruby to avoid replaying the request in non-local regions. It feels like I've thrown the baby out with the bathwater to some extent, maybe I'll hack on it another time and see if there's a middle ground somewhere.

jsierles commented 3 years ago

OK, cool. It all depends on what your app is doing really, and I think for websocket-heavy situations, it's worth considering whether a similar read/write split could be done at the RPC level.

Also, a multimaster keydb might still be useful for the sessions (and for cache), since you would still get the local speed but perhaps deal with CSRF. I'll give this a try.

Matt-Yorkley commented 3 years ago

a multimaster keydb might still be useful for the sessions

That's definitely on my TODO list of things to play around with.

Matt-Yorkley commented 3 years ago

After going down the explicit connection-switching route I extracted this concern for some nice human-readable method names. It works nicely with around_action callbacks and slots into both Controller and Reflex classes: https://github.com/mra-clubhouse/clubhouse/commit/85f933aabe122691c4c44e7d57494913d7a53a64

With Websockets / ActionCable the regular HTTP verbs are irrelevant and it avoids the ActionController stack pretty much entirely, but then with StimulusReflex specifically; it dips back in to controllers and views to grab just enough context for rendering things... and at that point if the controller isn't already being explicit about which database it's connecting to, the Reflex ends up defaulting to the primary for both writes and reads.

jsierles commented 3 years ago

Nice! I'd love to pull that into the gem, and also automate the configuration of the read/write roles.

Matt-Yorkley commented 3 years ago

I'd love to pull that into the gem

Feel free to steal anything you like the look of :ok_hand:

I'm excited by the prospect of a multi-active database though (not just the cache), it would mean all this complexity could be removed from the application code, as these solutions wouldn't be needed in the first place...