Replication monitoring exception with Postgres flavor Amazon Auroroa

breiter commented 4 years ago

Summary:

PG::FeatureNotSupported exception is thrown occasionally when using Amazon Aurora as the backend database for Canvas. The problem is that the pg_current_xlog_location() and other system management functions are not provided in the Aurora flavor of Postgresql.

I checked against the master branch and this problem is not addressed. I do have a straightforward solution to detect that the replication monitoring functions are not supported and skip over the pause for replication code. I have provided a patch in the "additional notes" section of this issue.

PG::FeatureNotSupported: ERROR:  Function pg_current_xlog_location() is currently not supported for Aurora: SELECT pg_current_xlog_location() caused by ERROR:  Function pg_current_xlog_location() is currently not supported for Aurora

<Caused>/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in
`exec_params'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:611:in
`block (2 levels) in exec_no_cache'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/dependencies/interlock.rb:48:in
`block in permit_concurrent_loads'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/concurrency/share_lock.rb:187:in
`yield_shares'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/dependencies/interlock.rb:47:in
`permit_concurrent_loads'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:610:in
`block in exec_no_cache'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract_adapter.rb:581:in
`block (2 levels) in log'/usr/lib/ruby/2.4.0/monitor.rb:214:in
`mon_synchronize'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract_adapter.rb:580:in
`block in log'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/notifications/instrumenter.rb:23:in
`instrument'/usr/lib/ruby/2.4.0/delegate.rb:83:in
`method_missing'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/sharded_instrumenter.rb:20:in
`instrument'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract_adapter.rb:571:in
`log'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/active_record/abstract_adapter.rb:37:in
`log'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:609:in
`exec_no_cache'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql_adapter.rb:596:in
`execute_and_clear'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/postgresql/database_statements.rb:81:in
`exec_query'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/database_statements.rb:478:in
`select'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/database_statements.rb:70:in
`select_all'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/query_cache.rb:104:in
`block in select_all'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/active_record/query_cache.rb:116:in
`block in cache_sql'/usr/lib/ruby/2.4.0/monitor.rb:214:in
`mon_synchronize'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/active_record/query_cache.rb:99:in
`cache_sql'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/query_cache.rb:104:in
`select_all'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/database_statements.rb:94:in
`select_rows'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/connection_adapters/abstract/database_statements.rb:82:in
`select_value'/usr/src/app/config/initializers/active_record.rb:632:in
`block (2 levels) in current_xlog_location'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/shackles.rb:22:in
`activate'/home/docker/.gem/ruby/2.4.0/gems/switchman-inst-jobs-1.4.1/lib/switchman_inst_jobs/shackles.rb:8:in
`activate'/usr/src/app/config/initializers/active_record.rb:626:in
`block in current_xlog_location'/home/docker/.gem/ruby/2.4.0/gems/switchman-1.14.10/lib/switchman/database_server.rb:107:in
`unshackle'/usr/src/app/config/initializers/active_record.rb:625:in
`current_xlog_location'/usr/src/app/app/models/discussion_topic/materialized_view.rb:111:in
`materialized_view_json'/usr/src/app/app/models/discussion_topic/materialized_view.rb:55:in
`materialized_view_for'/usr/src/app/app/models/discussion_topic.rb:1583:in
`materialized_view'/usr/src/app/app/controllers/discussion_topics_api_controller.rb:132:in
`view'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/basic_implicit_render.rb:6:in
`send_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/abstract_controller/base.rb:194:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/rendering.rb:30:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/abstract_controller/callbacks.rb:42:in
`block in process_action'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:109:in
`block in run_callbacks'/usr/src/app/app/controllers/application_controller.rb:546:in
`report_to_datadog'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:118:in
`block in run_callbacks'/home/docker/.gem/ruby/2.4.0/gems/inst_statsd-2.1.6/lib/inst_statsd/statsd.rb:95:in
`batch'/usr/src/app/app/controllers/application_controller.rb:534:in
`batch_statsd'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:118:in
`block in run_callbacks'/usr/src/app/lib/temp_cache.rb:28:in
`enable'/usr/src/app/app/controllers/application_controller.rb:530:in
`enable_request_cache'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:118:in
`block in run_callbacks'/usr/src/app/app/controllers/application_controller.rb:524:in
`set_locale'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:118:in
`block in run_callbacks'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:136:in
`run_callbacks'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/abstract_controller/callbacks.rb:41:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/rescue.rb:22:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/instrumentation.rb:34:in
`block in process_action'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/notifications.rb:168:in
`block in instrument'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/notifications/instrumenter.rb:23:in
`instrument'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/notifications.rb:168:in
`instrument'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/instrumentation.rb:32:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal/params_wrapper.rb:256:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/activerecord-5.2.3/lib/active_record/railties/controller_runtime.rb:24:in
`process_action'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/abstract_controller/base.rb:134:in
`process'/home/docker/.gem/ruby/2.4.0/gems/actionview-5.2.3/lib/action_view/rendering.rb:32:in
`process'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal.rb:191:in
`dispatch'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_controller/metal.rb:252:in
`dispatch'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/routing/route_set.rb:52:in
`dispatch'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/routing/route_set.rb:34:in
`serve'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/journey/router.rb:52:in
`block in serve'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/journey/router.rb:35:in
`each'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/journey/router.rb:35:in
`serve'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/routing/route_set.rb:840:in
`call'/usr/src/app/gems/plugins/respondus_soap_endpoint/lib/respondus_soap_endpoint/middleware.rb:78:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/deflater.rb:44:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/chunked.rb:98:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/tempfile_reaper.rb:15:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/etag.rb:27:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/conditional_get.rb:27:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/head.rb:12:in
`call'/usr/src/app/app/middleware/request_throttle.rb:63:in
`block in call'/usr/src/app/app/middleware/request_throttle.rb:317:in
`reserve_capacity'/usr/src/app/app/middleware/request_throttle.rb:58:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/http/content_security_policy.rb:18:in
`call'/usr/src/app/app/middleware/request_context_session.rb:25:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/session/abstract/id.rb:266:in
`context'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/session/abstract/id.rb:260:in
`call'/usr/src/app/app/middleware/load_account.rb:29:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/cookies.rb:670:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/callbacks.rb:28:in
`block in call'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/callbacks.rb:98:in
`run_callbacks'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/callbacks.rb:26:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/debug_exceptions.rb:61:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/show_exceptions.rb:33:in
`call'/home/docker/.gem/ruby/2.4.0/gems/railties-5.2.3/lib/rails/rack/logger.rb:38:in
`call_app'/home/docker/.gem/ruby/2.4.0/gems/railties-5.2.3/lib/rails/rack/logger.rb:28:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/remote_ip.rb:81:in
`call'/home/docker/.gem/ruby/2.4.0/gems/request_store-1.5.0/lib/request_store/middleware.rb:19:in
`call'/usr/src/app/app/middleware/request_context_generator.rb:49:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/method_override.rb:24:in
`call'/usr/src/app/app/middleware/prevent_non_multipart_parse.rb:33:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/runtime.rb:22:in
`call'/home/docker/.gem/ruby/2.4.0/gems/activesupport-5.2.3/lib/active_support/cache/strategy/local_cache_middleware.rb:29:in
`call'/home/docker/.gem/ruby/2.4.0/gems/actionpack-5.2.3/lib/action_dispatch/middleware/executor.rb:14:in
`call'/home/docker/.gem/ruby/2.4.0/gems/rack-2.2.2/lib/rack/sendfile.rb:110:in
`call'/home/docker/.gem/ruby/2.4.0/gems/railties-5.2.3/lib/rails/engine.rb:524:in
`call'/home/docker/.gem/ruby/2.4.0/gems/railties-5.2.3/lib/rails/railtie.rb:190:in
`public_send'/home/docker/.gem/ruby/2.4.0/gems/railties-5.2.3/lib/rails/railtie.rb:190:in
`method_missing'/usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in
`process_request'/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:149:in
`accept_and_process_next_request'/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in
`main_loop'/usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:415:in
`block (3 levels) in start_threads'/usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in
`block in create_thread_and_abort_on_exception'

Steps to reproduce:

API throws an exception for example when loading comments on a discussion.

Expected behavior:

Database queries should succeed and not throw exceptions.

Actual behavior:

Queries that use replication monitoring throw an exception PG:FeatureNotSupported.

Additional notes:

I have a solution for this problem to detect PG:FeatureNotSupported and then not use the replication monitoring functions:

diff --git a/config/initializers/active_record.rb b/config/initializers/active_record.rb
index 507bdac229..d9867f861c 100644
--- a/config/initializers/active_record.rb
+++ b/config/initializers/active_record.rb
@@ -621,15 +621,26 @@ class ActiveRecord::Base
     end
   end

+  def self.with_feature_support(default = nil)
+    begin
+      yield
+    rescue ActiveRecord::StatementInvalid => e
+      raise unless e.message.start_with?("PG::FeatureNotSupported:")
+      default
+    end
+  end
+
   def self.current_xlog_location
     Shard.current(shard_category).database_server.unshackle do
       Shackles.activate(:master) do
-        if Rails.env.test? ? self.in_transaction_in_test? : connection.open_transactions > 0
-          raise "don't run current_xlog_location in a transaction"
-        elsif connection.send(:postgresql_version) >= 100000
-          connection.select_value("SELECT pg_current_wal_lsn()")
-        else
-          connection.select_value("SELECT pg_current_xlog_location()")
+        with_feature_support do
+          if Rails.env.test? ? self.in_transaction_in_test? : connection.open_transactions > 0
+            raise "don't run current_xlog_location in a transaction"
+          elsif connection.send(:postgresql_version) >= 100000
+            connection.select_value("SELECT pg_current_wal_lsn()")
+          else
+            connection.select_value("SELECT pg_current_xlog_location()")
+          end
         end
       end
     end
@@ -638,20 +649,23 @@ class ActiveRecord::Base
   def self.wait_for_replication(start: nil, timeout: nil)
     return true unless Shackles.activate(:slave) { connection.readonly? }

-    start ||= current_xlog_location
-    Shackles.activate(:slave) do
-      diff_fn = connection.send(:postgresql_version) >= 100000 ?
-        "pg_wal_lsn_diff" :
-        "pg_xlog_location_diff"
-      fn = connection.send(:postgresql_version) >= 100000 ?
-        "pg_last_wal_replay_lsn()" :
-        "pg_last_xlog_replay_location()"
-      # positive == first value greater, negative == second value greater
-      # SELECT pg_xlog_location_diff(<START>, pg_last_xlog_replay_location())
-      start_time = Time.now
-      while connection.select_value("SELECT #{diff_fn}(#{connection.quote(start)}, #{fn})").to_i >= 0
-        return false if timeout && Time.now > start_time + timeout
-        sleep 0.1
+    loc = current_xlog_location
+    unless loc.nil?
+      start ||= loc
+      Shackles.activate(:slave) do
+        diff_fn = connection.send(:postgresql_version) >= 100000 ?
+          "pg_wal_lsn_diff" :
+          "pg_xlog_location_diff"
+        fn = connection.send(:postgresql_version) >= 100000 ?
+          "pg_last_wal_replay_lsn()" :
+          "pg_last_xlog_replay_location()"
+        # positive == first value greater, negative == second value greater
+        # SELECT pg_xlog_location_diff(<START>, pg_last_xlog_replay_location())
+        start_time = Time.now
+        while connection.select_value("SELECT #{diff_fn}(#{connection.quote(start)}, #{fn})").to_i >= 0
+          return false if timeout && Time.now > start_time + timeout
+          sleep 0.1
+        end
       end
     end
     true

ccutrer commented 4 years ago

can you convert this to a regular pull request please, so that the CLA can be put on file.

also, is there a way to detect aurora Postgres beforehand, to avoid the error? and barring that, cache that a particular feature is not available, so that you don't keep hitting the error over and over?

breiter commented 4 years ago

Hi Cody,

I created the pull request that you asked for.

As to an alternative implementation: I don't know how much wait_for_replication is used or how much overhead there is to discarding an exception in the aurora case.

There is a magic aurora_version() function in Aurora PostgreSQL which will return a version string. However it is not defined in pg_proc so select * from pg_proc where proname = 'aurora_version'; does not work to detect its existence. You could try to call select auorora_version and catch the exception but that would be the general case for non-aurora instances of pgsql. An alternative might be to test for the existence of the log management system functions. Something like:

aurora = connection.select_value("select count(*) from pg_proc where proname in ( 'pg_current_wal_lsn','pg_current_xlog_location')") == 0

unless aurora 
    if Rails.env.test? ? self.in_transaction_in_test? : connection.open_transactions > 0
        raise "don't run current_xlog_location in a transaction"
    elsif connection.send(:postgresql_version) >= 100000
        connection.select_value("SELECT pg_current_wal_lsn()")
    else
        connection.select_value("SELECT pg_current_xlog_location()")
    end
end
    nil

If you really want to optimize this the resulting bool could be cached in a static property so that it only has to be remoted once per canvas instance.

On the other hand, the approach of the with_feature_support method means that the replication monitoring would start working in aurora if Amazon decided to support those pgsql system functions.

I don't have a sense for how much optimization is worthwhile here. Does wait_for_replication get called a lot or just for a few things? Is there very much overhead in just catching the exception and then abandoning the wait? I mean as a practical matter, the whole point of this method seems to be to block returning results for some indefinite period which is at least some set of 100ms sleep periods until the slaves synchronize anyway.

breiter commented 3 years ago

This appears to work with Amazon Aurora provisioned instance with the rds.logical_replication parameter to 1. (This is not the default.)

However, setting this option is ignored in Aurora Serverless. One noticeable effect is that the ajax load of discussions fails. The json endpoint returns an error message that is not surfaced in the UI, but the error is logged in the database:

{"errors":[{"message":"An error occurred.","error_code":"internal_server_error"}],"error_report_id":102}

PG::ObjectNotInPrerequisiteState: 
ERROR:  wal_level must be set to 'logical'
HINT:  WAL control functions cannot be executed when wal_level < logical. 
caused by ERROR:  wal_level must be set to 'logical'
HINT:  WAL control functions cannot be executed when wal_level < logical.

ccutrer commented 3 years ago

can you do a SHOW wal_level on each of those databases? for local development, I see replica, and have no problems executing SELECT pg_current_wal_lsn(). in our production systems I'm getting logical, which is obviously working for me. I'm assuming for your one that you're getting an error, it will be minimal. I'm curious about the one with rds.logical_replication=1 - is the wal_level set to logical, but it still doesn't have the function?

breiter commented 3 years ago

Aurora Provisioned Instances pgsql engine 9.6:

select pgcurrent_wal_level();

function pg_current_wal_lsn() does not exist

show wal_level;

wal_level
replica

Aurora Serverles pgsql engine 10.14

select pgcurrent_wal_level();

wal_level must be set to 'logical'

show wal_level;

wal_level
replica

Aurora Provisioned Instances pgsql engine 10.14:

(with rds.logical_replication parameter=1)

select pgcurrent_wal_level();

pg_current_wal_lsn
0/50CE5B28

show wal_level;

wal_level
logical

ccutrer commented 3 years ago

Hmm, I'm not sure what to do then for the Aurora Serverless instance. It's definitely the case in "real" Postgres that you can call pg_current_wal_lsn when wal_level is set to replica.

breiter commented 3 years ago

And to be clear, you cannot set wal_level to anything other than replica in Aurora Serverless. The option is there but it is ignored and I spoke with someone in the RDS team who told me their internal documentation says it is currently not possible to set wal_level in Aurora Serverless.

LordRatte commented 2 years ago

Has anyone discovered a workaround we can use for this case?

breiter commented 2 years ago

It works with Aurora provisioned but not Serverless.

instructure / canvas-lms