Closed breiter closed 3 years ago
can you convert this to a regular pull request please, so that the CLA can be put on file.
also, is there a way to detect aurora Postgres beforehand, to avoid the error? and barring that, cache that a particular feature is not available, so that you don't keep hitting the error over and over?
Hi Cody,
I created the pull request that you asked for.
As to an alternative implementation: I don't know how much wait_for_replication
is used or how much overhead there is to discarding an exception in the aurora case.
There is a magic aurora_version()
function in Aurora PostgreSQL which will return a version string. However it is not defined in pg_proc
so select * from pg_proc where proname = 'aurora_version';
does not work to detect its existence. You could try to call select auorora_version
and catch the exception but that would be the general case for non-aurora instances of pgsql.
An alternative might be to test for the existence of the log management system functions. Something like:
aurora = connection.select_value("select count(*) from pg_proc where proname in ( 'pg_current_wal_lsn','pg_current_xlog_location')") == 0
unless aurora
if Rails.env.test? ? self.in_transaction_in_test? : connection.open_transactions > 0
raise "don't run current_xlog_location in a transaction"
elsif connection.send(:postgresql_version) >= 100000
connection.select_value("SELECT pg_current_wal_lsn()")
else
connection.select_value("SELECT pg_current_xlog_location()")
end
end
nil
If you really want to optimize this the resulting bool could be cached in a static property so that it only has to be remoted once per canvas instance.
On the other hand, the approach of the with_feature_support
method means that the replication monitoring would start working in aurora if Amazon decided to support those pgsql system functions.
I don't have a sense for how much optimization is worthwhile here. Does wait_for_replication get called a lot or just for a few things? Is there very much overhead in just catching the exception and then abandoning the wait? I mean as a practical matter, the whole point of this method seems to be to block returning results for some indefinite period which is at least some set of 100ms sleep periods until the slaves synchronize anyway.
This appears to work with Amazon Aurora provisioned instance with the rds.logical_replication parameter to 1. (This is not the default.)
However, setting this option is ignored in Aurora Serverless. One noticeable effect is that the ajax load of discussions fails. The json endpoint returns an error message that is not surfaced in the UI, but the error is logged in the database:
{"errors":[{"message":"An error occurred.","error_code":"internal_server_error"}],"error_report_id":102}
PG::ObjectNotInPrerequisiteState:
ERROR: wal_level must be set to 'logical'
HINT: WAL control functions cannot be executed when wal_level < logical.
caused by ERROR: wal_level must be set to 'logical'
HINT: WAL control functions cannot be executed when wal_level < logical.
can you do a SHOW wal_level
on each of those databases? for local development, I see replica
, and have no problems executing SELECT pg_current_wal_lsn()
. in our production systems I'm getting logical
, which is obviously working for me. I'm assuming for your one that you're getting an error, it will be minimal
. I'm curious about the one with rds.logical_replication=1
- is the wal_level
set to logical
, but it still doesn't have the function?
select pgcurrent_wal_level();
function pg_current_wal_lsn() does not exist
show wal_level;
wal_level
replica
select pgcurrent_wal_level();
wal_level must be set to 'logical'
show wal_level;
wal_level
replica
(with rds.logical_replication parameter=1
)
select pgcurrent_wal_level();
pg_current_wal_lsn
0/50CE5B28
show wal_level;
wal_level
logical
Hmm, I'm not sure what to do then for the Aurora Serverless instance. It's definitely the case in "real" Postgres that you can call pg_current_wal_lsn when wal_level is set to replica.
And to be clear, you cannot set wal_level to anything other than replica in Aurora Serverless. The option is there but it is ignored and I spoke with someone in the RDS team who told me their internal documentation says it is currently not possible to set wal_level in Aurora Serverless.
Has anyone discovered a workaround we can use for this case?
It works with Aurora provisioned but not Serverless.
Summary:
PG::FeatureNotSupported exception is thrown occasionally when using Amazon Aurora as the backend database for Canvas. The problem is that the pg_current_xlog_location() and other system management functions are not provided in the Aurora flavor of Postgresql.
I checked against the master branch and this problem is not addressed. I do have a straightforward solution to detect that the replication monitoring functions are not supported and skip over the pause for replication code. I have provided a patch in the "additional notes" section of this issue.
Steps to reproduce:
API throws an exception for example when loading comments on a discussion.
Expected behavior:
Database queries should succeed and not throw exceptions.
Actual behavior:
Queries that use replication monitoring throw an exception PG:FeatureNotSupported.
Additional notes:
I have a solution for this problem to detect PG:FeatureNotSupported and then not use the replication monitoring functions: