ActiveRecord reconnects without refreshing token

haines commented 2 years ago

If the reconnect! method is called after the original token has expired (due to a very long-running query), we don't generate a new token and therefore get an error:

PG::Error: FATAL:  PAM authentication failed for user

eshults35 commented 2 years ago

I also noticed that when doing a rails db:reset, the db:create will go through fine but then I am prompted for a password when it tries to reconnect (Ruby 2.7.1). Do you know off hand if it is using this reconnect method in doing so? I'm going to spend some time debugging this as this is a feature we are interested in having.

haines commented 2 years ago

Hmm, interesting.

This sounds like a different issue to me. If it was due to token expiry, you'd see the PAM authentication failed for user error. Even if it did call reconnect! during the db:reset, that would only be a problem if it happened after holding a connection for more than 15 minutes - the connection would still have its password set to a token, and if it was less than 15 minutes since the token was generated then the token would still be valid.

If you're getting prompted for a password, that probably means the connection somehow doesn't have a password set at all, so perhaps the token generator wasn't even called. I'm not sure how that could happen (especially if the initial db:create works), so please do report back if you find anything out!

eshults35 commented 2 years ago

Will do. Just doing a db:create followed by db:migrate is fine. But I'm curious to see what the db:reset function does specifically and if the issue there may be observed in another use case.

eshults35 commented 2 years ago

Looks like your assumption was correct - likely due to an issue unrelated to this.

Just curious - do you know under what conditions reconnect! is ever called?

cswilliams commented 2 years ago

We've been using this gem for a few months now across about 100 postgres RDS instances and I'd say we run into auth issues in our rails app about once every few weeks. It's infrequent enough that it's been hard to track down and a restart of our rails app always fixes it. I've finally opened an AWS ticket today to try to get more information from them. But I'd be curious if there was any sample code that we could use to catch these errors and retry? It seems like when our rails app hits an IAM auth error, the entire rails app has to be restarted...

I'll also add that when we hit these auth errors, it's almost always when our app is restarting (after a new deploy) or the database is brand new. I am aware that smaller RDS instances like db.t3.micro can get overwhelmed and cause auth failures, however, we've seen auth errors on up to db.t3.large as well. At any rate, I'm hoping AWS may be able to shed more light on whether it's an issue in the library or their service.

haines commented 2 years ago

Hi @cswilliams - yeah, I've had a bunch of issues on AWS's end with underprovisioned database instances (https://github.com/haines/pg-aws_rds_iam/issues/248#issuecomment-901702886). If that's the problem, you should be able to see it as HTTP errors in the RDS error logs. That might also be the case if the database instance is brand new (perhaps the token verification service is not ready?).

To get it to recover without restarting the whole application, you might be able to rescue the PG::Error and then use Active Record's remove_connection to clean up the bad connection.

eshults35 commented 2 years ago

I encountered this today in a staging environment, so I spent the evening debugging for a solution.

I was able to monkey patch the reconnect! method like so

module ActiveRecord
  module ConnectionAdapters
    class PostgreSQLAdapter < AbstractAdapter
      def reconnect!
        if @config.has_key?(:aws_rds_iam_auth_token_generator)
          disconnect!
          @connection = PG::connect(@connection_parameters)
        end
        @lock.synchronize do
          super
          @connection.reset
          configure_connection
        rescue PG::ConnectionBad
          connect
        end
      end
    end
  end
end

In testing manually with Rails and just forcing a reconnect! call, I was able to force a new connection and thus a new password.

However I had issues utilizing the prepending method in order to accomplish this. I also don't know if it's necessary to clear adapter cache in the event we re-establish a connection altogether. I'm also not clear on how to reproduce the issue to force a production like re-occurrence of the issue rather than just manually running reconnect! to see if any fallout occurs as a result of forcing the connection update in that manner. I'm thinking maybe writing a script that executes a pg_sleep() for a certain amount of time immediately followed by a SELECT 1?

I also prepended ConnectionHandling in order to allow for Route53 support, if you are interested in me sharing.

eshults35 commented 2 years ago

Interestingly our bugsnag showed it failing in configure_connections, line 769:

PG::Error
 @connection.set_client_encoding(@config[:encoding])

It doesn't seem to be throwing PG::ConnectionBad in our example which reconnect! is already rescuing. There seem to be a mix of PG::ConnectionBad and PG::Error exception catches throughout the entirety of the PostgreSQLadapter class so I'm thinking that the introduction of this use case has exposed this particular blind spot. And running connect within the rescue block does following:

    def connect
          @connection = PG.connect(@connection_parameters)
          configure_connection
          add_pg_encoders
          add_pg_decoders
        end

That PG.connect should invoke the auth injector which would force a new connection with a fresh token. So perhaps the cleanest, least impactful fix is just adding the PG:Error to the reconnect! rescue block as so:

def reconnect!
        @lock.synchronize do
          super
          @connection.reset
          configure_connection
        rescue PG::ConnectionBad, PG::Error
          connect
        end
      end

I'm running a console through RubyMine and debugging and will test issuing a reconnect! 30 minutes or so after initially establishing an RDS IAM auth connection to see if I can reproduce that way to confirm the error being thrown as well as the viability of this fix.

eshults35 commented 2 years ago

So this morning I woke up in a state where I could reproduce the PG:Error on demand by issuing a reconnect.

It seems I was able to fix the issue by moving the rescue block within the do block. The rescue was ignored outside of the do block. After making this change and running reload!, reconnect! successfully hit the rescue block, ran connect, got a new password from AWS, and re-established the connection:

module ActiveRecord
  module ConnectionAdapters
    class PostgreSQLAdapter < AbstractAdapter
      def reconnect!
        @lock.synchronize do
          super
          @connection.reset
          configure_connection
          rescue PG::ConnectionBad, PG::Error
            connect
        end
      end
    end
  end
end

Is this actually a bug with rails? What is the point of the rescue block if it never gets hit considering the only code in reconnect! is all within that do block?

eshults35 commented 2 years ago

Just an update on this - it appears what I described above regarding reconnect! Is a bug in rails 6. In rails 7, the rescue block was moved within the do block such that a connect failure is actually caught. Rails 6 also seems to throw a pg::error instead of a pg::connecttionbad. I haven't tested rails 7 to see if it properly throws pg::connectionbad. But for rails 6, the fix is to monkeypatch reconnect! and move the rescue block within the do block and catch pg::error. Then connect is properly called which calls the prependrd parse_args method and gets a new password. I have done this for our integration and have yet to run into this issue again.

I also found a way to reproduce this error in a rails console. W/ an rds iam connection: ActiveRecord.connection.execute("select pg_sleep(6000)") Walk away Come back, ActiveRecord.connection.reconnect!

With the reconnect! monkeypatch, I was also able to successfully go through the above test and get a new connection.

haines commented 2 years ago

I haven't been able to reproduce the error with pg 1.3 or 1.4 - both these versions seem to correctly throw PG::ConnectionBad when calling reset on a connection with a stale auth token, meaning Active Record rescues and creates a new connection.

@eshults35 could you please try again with one of these versions of pg (for 1.4 you'll also need to bump pg-aws_rds_iam to 0.4.0)?

haines commented 2 years ago

I'm going to close this now, but I'm happy to reopen if it can be reproduced.

forever-sumit commented 4 months ago

In normal case if postgres stop due to any reason, rails try to reconnect when a new request arrive. and if postgres service start at that time rails connect with database using the cached setting of database connection(it doesn't read the database.yml again).

Using RDS with IAM authentication the token is valid for 15 minutes. so if the postgres service stop after 15 minute and restart again then rails try to connect with RDS using the cached connection information. At that time, as the token is expired so rails does't connect with RDS postgres.

So I wanted to check does this gem regenerate the token if token expire ?

haines commented 4 months ago

Hi @forever-sumit, this gem injects a fresh auth token into the connection string every time a new connection is created, so yes, it should work just fine.

forever-sumit commented 4 months ago

Thanks @haines , this gem solve my 15 days of headcache

haines / pg-aws_rds_iam

ActiveRecord reconnects without refreshing token #290