Closed jon-betts closed 2 years ago
I reckon for testing this we should add a read replica in the mix to also test the FDW details get replicate down to read replicas.
Some initial notes:
fdw
Lists 4 steps:
Install the postgres_fdw extension using CREATE EXTENSION.
Create a foreign server object, using CREATE SERVER, to represent each remote database you want to connect to. Specify connection information, except user and password, as options of the server object.
Create a user mapping, using CREATE USER MAPPING, for each database user you want to allow to access each foreign server. Specify the remote user name and password to use as user and password options of the user mapping.
Create a foreign table, using CREATE FOREIGN TABLE or IMPORT FOREIGN SCHEMA, for each remote table you want to access. The columns of the foreign table must match the referenced remote table. You can, however, use table and/or column names different from the remote table's, if you specify the correct remote names as options of the foreign table object.
postgres_fdw
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
This needs to be a super user. Potential issue here?
You can debug which plugins are available with:
SELECT * FROM pg_available_extensions where name='postgres_fdw';
This is available in H and LMS in production.
When you create a server you don't specify the username or password, that's attached to the user:
(specify these in a user mapping, instead, or use a service file)
Service files probably aren't an option, as I think they are on the DB machine itself.
The SQL to create in the first place looks a bit like this:
CREATE SERVER IF NOT EXISTS local_server_name
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (
host 'hostname',
dbname 'dbname',
port '5432'
)
I think we can alter the settings here like this:
ALTER SERVER local_server_name
OPTIONS (
SET host 'new_host',
SET port 'new_port'
)
CREATE USER MAPPING
IF NOT EXISTS
FOR local_reporting_user
SERVER local_server_name
OPTIONS (
user 'remote_reporting_user',
password 'remote_password'
)
This will need to exist on both sides.
We can alter the user's settings with something like:
ALTER USER MAPPING
FOR local_reporting_user
SERVER local_server_name
OPTIONS (
SET user 'new_user',
SET password 'new_password'
)
You appear to be able to specify every column, or import directly from a remote table. It's seems like the name you choose is the link between the two.
Specifying every column:
CREATE FOREIGN TABLE films (
code char(5) NOT NULL,
title varchar(40) NOT NULL,
...
)
SERVER film_server;
This kind of sucks as we'd have to duplicate the entire schema we are targetting. On the plus side we can pick and choose columns.
Slightly better if we can see all columns is to not specify the exact schema ourselves, but pull it in:
IMPORT FOREIGN SCHEMA film_schema
LIMIT TO (films)
FROM film_server
INTO local_schema
This will automatically copy it over.
What about table schema changes? It seems you have to blow it away and re-import, or manually tweak it, which seems like a pain. If we have calculated tables or views based on this, we are probably going to have to re-calculate them again. This would be a plus point for live, rather than materialized views.
We probably want to look at setting updatable=False
on our tables. If our user is read-only we will fail anyway, but this means the error would be raised on our side before we try.
I have been looking into modifications needed to our current topology to enable an inter-region connection. Leaving compliance to one side for the moment. A solution could be found using VPC Peering. A technology that enables IP routing between subnets hosted in different VPCs.
Here is a rough diagram detailing the change from the perspective of a Hypothesis region.
TLDR summary:
My biggest concern is around using the replica to run BI queries (which tend to be on slow side).
We should consider querying the primary or, accepting that there's a limit on how long queries can take and that limit is different depending on the activity on the primary server. I find that limit hard to reason about and the steps to increase/control it all seem to take a toll on the primary's performance which might be one the reason to pick the replica in the first place.
Some testing python code with comments for findings around postgres FDW.
Setting up connections to H and LMS.
import os
import uuid
import subprocess
import psycopg2
databases = {
"h": {
"host": os.environ.get("H_DB_HOST", "localhost"),
"port": os.environ.get("H_DB_PORT", 5432),
"name": os.environ.get("H_DB_NAME", "postgres"),
"user": os.environ.get("H_DB_USER", "postgres"),
"password": os.environ.get("H_DB_PASS", "postgres"),
},
"lms": {
"host": os.environ.get("LMS_DB_HOST", "localhost"),
"port": os.environ.get("LMS_DB_PORT", 5433),
"name": os.environ.get("LMS_DB_NAME", "postgres"),
"user": os.environ.get("LMS_DB_USER", "postgres"),
"password": os.environ.get("LMS_DB_PASS", "postgres"),
},
}
def db_connection(host, port, name, user, password):
return psycopg2.connect(
host=host, port=port, database=name, user=user, password=password
)
h_conn = db_connection(**databases["h"])
lms_conn = db_connection(**databases["lms"])
h_cur = h_conn.cursor()
lms_cur = lms_conn.cursor()
We won't use the existing user in the H database but just one that has access to only selected tables.
Here we are creating this restricted user in H first.
FDW_USER = "fdw_user"
FDW_PASSWORD = "fdw_password"
try:
h_cur.execute(f"CREATE USER {FDW_USER} WITH PASSWORD %s", (FDW_PASSWORD,))
except psycopg2.errors.DuplicateObject:
print(f"Role {FDW_USER} already exists.")
h_conn.rollback()
We are only granting this user access to a couple of tables.
H_TABLES = ["annotation", "document"]
for table_name in H_TABLES:
h_cur.execute(f"GRANT SELECT ON {table_name} TO {FDW_USER}")
h_conn.commit()
Granting could for example at the schema level (eg all of an hypotheical reporting
schema)
In production we'd handle this user as we do any other RDS users, probably outside the application code, ideally still in some form of "code", ie we can make a ansible/terraform PR to add extra permissions.
These statements require RDS's superuser permissions.
They only need to be run once and they don't contain any parameters/setting.
We should run this as part of the RDS initialization process (if there's a script to do that) and maybe just manually for existing DBs (LMS).
Other alternatives could involve giving existing users super user privileges but that's to be avoided.
https://www.postgresql.org/docs/current/sql-createextension.html https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html
Create the extension as admin and grant its usage to the user that the LMS app runs with.
CREATE EXTENSION IF NOT EXISTS postgres_fdw;
GRANT USAGE ON FOREIGN DATA WRAPPER postgres_fdw TO "lms-prod";
All the following steps could be run on app init like:
https://github.com/hypothesis/h/blob/main/h/db/__init__.py
Point FDW to H (from LMS), here we'd take the connection details from env variables (the location of the H database).
https://www.postgresql.org/docs/current/sql-createserver.html
FDW_NAME = "h"
lms_cur.execute(
f"CREATE SERVER IF NOT EXISTS {FDW_NAME} FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname %s, host %s, port %s)",
(databases["h"]["name"], databases["h"]["host"], str(databases["h"]["port"])),
)
Now create a user mapping, this makes using a local user in LMS behave like the remote user in H.
Again this values could be read from environment variables.
https://www.postgresql.org/docs/current/sql-createusermapping.html
lms_cur.execute(
f"""
CREATE USER MAPPING IF NOT EXISTS FOR {databases['lms']['user']}
SERVER {FDW_NAME}
OPTIONS (user %s, password %s);
""",
(FDW_USER, FDW_PASSWORD),
)
Create a schema for the fdw tables, not necessary but tidier
lms_cur.execute("CREATE SCHEMA IF NOT EXISTS h;")
Now import the tables we are interested into the new schema
for table_name in H_TABLES:
try:
# Import a selection of tables into the recently created schema
lms_cur.execute(
f"""
IMPORT FOREIGN SCHEMA "public" LIMIT TO ({table_name}) FROM SERVER {FDW_NAME} INTO h;
"""
)
lms_conn.commit()
except psycopg2.errors.DuplicateTable:
print(f"H table {table_name} already imported.")
lms_conn.rollback()
after that we can query H data normally, we just need to specify the schema.
lms_cur.execute("SELECT count(id) from h.annotation")
print("Querying LMS for H's anno count:", lms_cur.fetchone()[0])
We'll simulate changes downstream creating a new column directly in H (not using the FDW)
h_cur.execute("ALTER TABLE annotation ADD COLUMN IF NOT EXISTS new_column TEXT")
h_conn.commit()
Trying to just query the new column will fail
try:
lms_cur.execute("SELECT new_column from h.annotation")
except psycopg2.errors.UndefinedColumn:
lms_conn.rollback()
We can still make queries not affected by the change
lms_cur.execute("SELECT count(id) from h.annotation")
print("Querying LMS for H's anno count:", lms_cur.fetchone()[0])
The remote table can be altered to match the source one
lms_cur.execute(
"ALTER FOREIGN TABLE h.annotation ADD COLUMN IF NOT EXISTS new_column TEXT"
)
lms_conn.commit()
And now we can query the new column
lms_cur.execute("SELECT new_column from h.annotation")
Alter the source table again, now remove the new column to go back to normal
h_cur.execute("ALTER TABLE annotation DROP COLUMN new_column")
h_conn.commit()
Dropping the foreign table and re-importing brings the latest schema in.
The same is true at the schema level, cascade deleting the schema where foreign data tables live and then re-importing them will update the structure of all tables.
lms_cur.execute("DROP SCHEMA IF EXISTS h CASCADE")
# Create a schema for the fdw tables, not necessary but tidier
lms_cur.execute("CREATE SCHEMA IF NOT EXISTS h;")
for table_name in H_TABLES:
try:
# Import a selection of tables into the recently created schema
lms_cur.execute(
f"""
IMPORT FOREIGN SCHEMA "public" LIMIT TO ({table_name}) FROM SERVER {FDW_NAME} INTO h;
"""
)
lms_conn.commit()
except psycopg2.errors.DuplicateTable:
print(f"H table {table_name} already imported.")
lms_conn.rollback()
This suggests that we should have different approaches available to manage schema changes:
This might be slow as some materialized views will need to run again. If in practice is "fast enough" it might be all we need.
It could be also useful to run locally to detect if any changes in the H/LMS schema break any of the new views.
We could setup alembic (where all migration are SQL based, not model based) or a similar mechanism to surgically alter tables/views, delete and recreate specific ones for when the "recreate the world" approach is not suitable.
Write operations are possible through FDW
lms_cur.execute(
"""INSERT INTO h.annotation (id, created, updated, userid, groupid, text, text_rendered, tags, shared, target_uri, target_uri_normalized, target_selectors, "references", extra, deleted, document_id)
SELECT %s, %s, %s, userid, groupid, text, text_rendered, tags, shared, target_uri, target_uri_normalized, target_selectors, "references", extra, deleted, document_id from h.annotation order by created limit 1""",
(
uuid.uuid4().hex,
"2020-10-13",
"2020-10-13",
),
)
We can make the whole server updatable=False and also change the setting at individual tables
lms_cur.execute("ALTER FOREIGN TABLE h.annotation OPTIONS (ADD updatable 'false')")
lms_conn.commit()
try:
lms_cur.execute(
"""INSERT INTO h.annotation (id, created, updated, userid, groupid, text, text_rendered, tags, shared, target_uri, target_uri_normalized, target_selectors, "references", extra, deleted, document_id)
SELECT %s, %s, %s, userid, groupid, text, text_rendered, tags, shared, target_uri, target_uri_normalized, target_selectors, "references", extra, deleted, document_id from h.annotation order by created limit 1""",
(
uuid.uuid4().hex,
"2020-10-13",
"2020-10-13",
),
)
except psycopg2.errors.ObjectNotInPrerequisiteState:
pass
lms_conn.rollback()
In a production environment we should set the whole server as updatable=False and then change the setting in individual tables.
Create a Materialized View (MT) from a foreign table (FT)
lms_cur.execute(
"""CREATE MATERIALIZED VIEW fdw_mv_anno
AS SELECT id, created, updated FROM h.annotation"""
)
lms_cur.execute(
"REFRESH MATERIALIZED VIEW fdw_mv_anno;"
)
lms_conn.commit()
MV can be queried
lms_cur.execute("SELECT count(*) from fdw_mv_anno")
lms_conn.commit()
We simulate an error in the source DB, we stop the docker container here
subprocess.run("docker stop h_postgres_1", shell=True, check=True)
MVs have the data in the local DB, no need to access the FDW, it works fine
lms_cur.execute("SELECT count(*) from fdw_mv_anno")
Running any query against foreign tables fails
try:
lms_cur.execute("SELECT count(*) from h.annotation")
except psycopg2.errors.ConnectionFailure:
lms_conn.rollback()
Below some performance testing on the test RDS instances. Also run some join queries using both foreign and local data.
In this testing I can't see a huge impact/overhead of the FDW usage.
I think the usage of a replica is orthogonal to both performance and the use of FDW however running long queries in replicas bring a whole new set of complexity as they can fail with stale data which is very likely on the context of BI.
Replication can be tuned to be this less likely but at the cost of performance and/or memory consumption on the primary.
https://www.postgresql.org/docs/14/hot-standby.html#HOT-STANDBY-CONFLICT
We should consider the option to point FDW directly to the H primary:
We believe that the Postgres foreign-data wrapper (FDW) module will be enough to satisfy our need to make cross DB queries.
In order for this to work we need to validate our assumptions.
Assumptions
No
Security concerns in region
Let's have a think that this doesn't cause any problems in region
Tasks
h