timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.34k stars 869 forks source link

[Bug]: CAGGs refresh policies migration to Cloud fails if the owner is not tsdbadmin #5745

Closed alejandrodnm closed 11 months ago

alejandrodnm commented 1 year ago

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Backup, Continuous aggregate

What happened?

When migrating to TS Cloud, if there's a CAGGs refresh policy owned by a role other than tsdbadmin there's no way to properly migrate the bgw_job.

Using pg_dump/pg_restore to copy the database, we successfully moved the _timescaledb_config.bgw_job table. The problem is that the jobs keep failing with ownership problem:

2023-06-05 12:18:35 UTC [1152]: ERROR: must be owner of view sample_cagg

It all boils down to tsdbadmin not having enough permissions to change ownership of objects created at _timescaledb_internal. For example:

ALTER MATERIALIZED VIEW team1.sample_cagg OWNER TO team1;

permission denied for schema _timescaledb_internal

I would've expected jobs to be migrated and running successfully.

In the How can we reproduce the bug? I added more context and how to reproduce it.

TimescaleDB version affected

2.11.0

PostgreSQL version used

15.3

What operating system did you use?

Timescale Cloud

What installation method did you use?

Not applicable

What platform did you run on?

Timescale Cloud

Relevant log output and stack trace

2023-06-05 12:18:34 UTC [118]: LOG: background worker "TimescaleDB Background Worker Scheduler" (PID 776) exited with exit code 1
2023-06-05 12:18:35 UTC [1152]: ERROR: must be owner of view sample_cagg
2023-06-05 12:18:35 UTC [1152]: LOG: job 1000 threw an error
2023-06-05 12:18:35 UTC [118]: " (PID 1152) exited with exit code 1

How can we reproduce the bug?

To reproduce the issue, let's create a source DB with the following:

CREATE DATABASE team1_db;
\c team1_db;
CREATE EXTENSION timescaledb;
CREATE ROLE team1 WITH LOGIN PASSWORD 'team1';
CREATE SCHEMA team1;
GRANT ALL ON SCHEMA team1 TO team1;

Log in as team1 and create a hypertable with a caggs and a refresh policy (using SET ROLE team1 won't work because the bgw_job won't properly set):

CREATE table team1.sample_hypertable (
    time timestamptz not null,
    key int not null,
    value float
);

SELECT create_hypertable('team1.sample_hypertable', 'time');

CREATE MATERIALIZED VIEW team1.sample_cagg
  WITH (timescaledb.continuous) AS
  SELECT
    time_bucket('1 day', "time") AS day,
    key,
    max(value) AS high
  FROM team1.sample_hypertable
  GROUP BY day, key;

SELECT add_continuous_aggregate_policy('team1.sample_cagg',
   start_offset => INTERVAL '3 days',
   end_offset => INTERVAL '1 hour',
   schedule_interval => INTERVAL '30 seconds');

This should create the following job:

> select * from timescaledb_information.job_stats s join _timescaledb_config.bgw_job j on j.id = s.job_id where j.id > 999

-[ RECORD 1 ]-------------------------
hypertable_schema      | _timescaledb_internal
hypertable_name        | _materialized_hypertable_2
job_id                 | 1000
last_run_started_at    | 2023-06-05 14:24:15.34038+02
last_successful_finish | 2023-06-05 14:24:15.378182+02
last_run_status        | Success
job_status             | Scheduled
last_run_duration      | 0:00:00.037802
next_start             | 2023-06-05 14:24:45.378182+02
total_runs             | 119
total_successes        | 119
total_failures         | 0
id                     | 1000
application_name       | Refresh Continuous Aggregate Policy [1000]
schedule_interval      | 0:00:30
max_runtime            | 0:00:00
max_retries            | -1
retry_period           | 0:00:30
proc_schema            | _timescaledb_internal
proc_name              | policy_refresh_continuous_aggregate
owner                  | team1
scheduled              | True
fixed_schedule         | False
initial_start          | <null>
hypertable_id          | 2
config                 | {"end_offset": "01:00:00", "start_offset": "3 days", "mat_hypertable_id": 2}
check_schema           | _timescaledb_internal
check_name             | policy_refresh_continuous_aggregate_check
timezone               | <null>

Let's create a dump:

pg_dump -d team1_db --no-owner > dump.sql

We need to prepare our TS Cloud target DB by creating the role and executing pre-restore (avoids metadata key error):

CREATE ROLE team1 WITH LOGIN;
GRANT team1 TO tsdbadmin; --required to insert in the bgw_job tables in TS Cloud
SELECT timescaledb_pre_restore();

NOTE About GRANT team1 TO tsdbadmin;, there's a trigger on the _timescaledb_config.bgw_job table called validate_job_role that will fail when inserting into bgw_job if the CURRENT_ROLE is not a member of the job's owner role:

\dftS+ public.validate_job_role

...
roleowner = NEW.owner;
...
IF NOT pg_catalog.pg_has_role(current_role, roleowner, 'MEMBER') THEN
  RAISE EXCEPTION 'Cannot % table %.% row ID %', TG_OP, TG_TABLE_SCHEMA, TG_TABLE_NAME, NEW.id USING HINT = 'owner is not a member of the current role',
  ERRCODE = '42501';
END IF;

Restore source into target:

psql -d $TS_CLOUD_TARGET_SERVICE < dump.sql

Run SELECT timescaledb_post_restore(); on the Cloud DB.

Let's inspect the jobs on our Cloud DB:

> select * from timescaledb_information.job_stats s join _timescaledb_config.bgw_job j on j.id = s.job_id where j.id > 999;

-[ RECORD 1 ]-------------------------
hypertable_schema      | _timescaledb_internal
hypertable_name        | _materialized_hypertable_2
job_id                 | 1000
last_run_started_at    | 2023-06-05 12:21:16.227955+00
last_successful_finish | -infinity
last_run_status        | Failed
job_status             | Scheduled
last_run_duration      | 0:00:00.008116
next_start             | 2023-06-05 12:23:39.204821+00
total_runs             | 6
total_successes        | 0
total_failures         | 6
id                     | 1000
application_name       | Refresh Continuous Aggregate Policy [1000]
schedule_interval      | 0:00:30
max_runtime            | 0:00:00
max_retries            | -1
retry_period           | 0:00:30
proc_schema            | _timescaledb_internal
proc_name              | policy_refresh_continuous_aggregate
owner                  | team1
scheduled              | True
fixed_schedule         | False
initial_start          | <null>
hypertable_id          | 2
config                 | {"end_offset": "01:00:00", "start_offset": "3 days", "mat_hypertable_id": 2}
check_schema           | _timescaledb_internal
check_name             | policy_refresh_continuous_aggregate_check
timezone               | <null>

We see that the last job status is FAILED. Looking in the service logs we get:

2023-06-05 12:18:34 UTC [118]: LOG: background worker "TimescaleDB Background Worker Scheduler" (PID 776) exited with exit code 1
2023-06-05 12:18:35 UTC [1152]: ERROR: must be owner of view sample_cagg
2023-06-05 12:18:35 UTC [1152]: LOG: job 1000 threw an error
2023-06-05 12:18:35 UTC [118]: " (PID 1152) exited with exit code 1

Since we used --no-owner to generate the dump, all the objects now belong to tsdbadmin. Trying to change ownership of the view back to tsdbadmin we get the following error:

> ALTER MATERIALIZED VIEW team1.sample_cagg OWNER TO team1;

permission denied for schema _timescaledb_internal

This is also the reason why we need to use --no-owner to generate the dump. Otherwise, statements like setting ownership of hypertable's views fail:

--
-- Name: _direct_view_2; Type: VIEW; Schema: _timescaledb_internal; Owner: team1
--

CREATE VIEW _timescaledb_internal._direct_view_2 AS
 SELECT public.time_bucket('1 day'::interval, sample_hypertable."time") AS day,
    sample_hypertable.key,
    max(sample_hypertable.value) AS high
   FROM team1.sample_hypertable
  GROUP BY (public.time_bucket('1 day'::interval, sample_hypertable."time")), sample_hypertable.key;

-- Setting back ownership to team1 fails with permission denied for schema
-- _timescaledb_internal
ALTER TABLE _timescaledb_internal._direct_view_2 OWNER TO team1;
alejandrodnm commented 1 year ago

This issue also applies to all the other policies:

2023-06-06 10:23:43 UTC [919]: DETAIL: Message: (must be owner of hypertable "compression_hypertable"), Detail: ().
2023-06-06 10:23:43 UTC [919]: WARNING: compressing chunk "_timescaledb_internal._hyper_1_14_chunk" failed when compression policy is executed

We can't set ownership

tsdbadmin@d3kzry1isu:tsdb> alter table team1.compression_hypertable owner to team1;
permission denied for schema _timescaledb_internal

NOTE Should the fact that the job status is SUCCESS be filed as a separate bug? If there's an ownership constraint, couldn't we check it when starting the job and fail like the other policies?

sotirissl commented 1 year ago

Hi @alejandrodnm, Thank you for reporting this issue and for the detailed reproduction steps. We have an update from our Cloud Team. They have fixed the issue and deployed it to the production. They have updated the permissions to existing services also. Please check again and let us know if now is working. Thank you.

Harkishen-Singh commented 1 year ago

I tried testing this issue on MST and Cloud today, and I still get the same error described in the issue

2023-06-13 10:35:19 UTC [118]: LOG: background worker "TimescaleDB Background Worker Scheduler" (PID 729) exited with exit code 1
2023-06-13 10:39:30 UTC [121]: LOG: checkpoint complete: wrote 13400 buffers (10.2%); 0 WAL file(s) added, 0 removed, 1 recycled; write=270.020 s, sync=0.004 s, total=270.032 s; sync files=306, longest=0.004 s, average=0.001 s; distance=131607 kB, estimate=131607 kB
2023-06-13 10:40:44 UTC [1116]: ERROR: must be owner of view sample_cagg
2023-06-13 10:40:44 UTC [1116]: LOG: job 1000 threw an error
2023-06-13 10:40:44 UTC [118]: " (PID 1116) exited with exit code 1
2023-06-13 10:41:03 UTC [1112]: STATEMENT: call run_job(1000);
2023-06-13 10:41:03 UTC [1112]: LOG: refreshing continuous aggregate "sample_cagg" in window [ 2023-06-11 00:00:00+00, 2023-06-13 00:00:00+00 ]
2023-06-13 10:41:11 UTC [1151]: ERROR: must be owner of view sample_cagg
2023-06-13 10:41:11 UTC [1151]: LOG: job 1000 threw an error
2023-06-13 10:41:11 UTC [118]: " (PID 1151) exited with exit code 1
Harkishen-Singh commented 1 year ago

UPDATE:

After doing alter table team1.sample_cagg owner to team1;, it succeed and then I see caggs working in the logs

Note: It seems that I have to run SELECT timescaledb_post_restore(); twice to make the caggs refresh policy show in the jobs view. This seems weird.

2023-06-13 10:44:40 UTC [1112]: LOG: statement: alter table team1.sample_cagg owner to team1;
2023-06-13 10:45:00 UTC [121]: LOG: checkpoint starting: time
2023-06-13 10:45:06 UTC [121]: LOG: checkpoint complete: wrote 62 buffers (0.0%); 0 WAL file(s) added, 0 removed, 2 recycled; write=6.114 s, sync=0.003 s, total=6.180 s; sync files=31, longest=0.002 s, average=0.001 s; distance=32232 kB, estimate=121670 kB
2023-06-13 10:45:11 UTC [1112]: STATEMENT: call run_job(1000);
2023-06-13 10:45:11 UTC [1112]: LOG: refreshing continuous aggregate "sample_cagg" in window [ 2023-06-11 00:00:00+00, 2023-06-13 00:00:00+00 ]
2023-06-13 10:45:53 UTC [1504]: LOG: refreshing continuous aggregate "sample_cagg" in window [ 2023-06-11 00:00:00+00, 2023-06-13 00:00:00+00 ]

Fetching caggs refresh policy details

tsdb=> select * from timescaledb_information.job_stats s join _timescaledb_config.bgw_job j on j.id = s.job_id where j.id > 999;
-[ RECORD 1 ]----------+-----------------------------------------------------------------------------
hypertable_schema      | _timescaledb_internal
hypertable_name        | _materialized_hypertable_2
job_id                 | 1000
last_run_started_at    | 2023-06-13 10:50:23.37566+00
last_successful_finish | 2023-06-13 10:50:23.381445+00
last_run_status        | Success
job_status             | Scheduled
last_run_duration      | 00:00:00.005785
next_start             | 2023-06-13 10:50:53.381445+00
total_runs             | 14
total_successes        | 10
total_failures         | 4
id                     | 1000
application_name       | Refresh Continuous Aggregate Policy [1000]
schedule_interval      | 00:00:30
max_runtime            | 00:00:00
max_retries            | -1
retry_period           | 00:00:30
proc_schema            | _timescaledb_internal
proc_name              | policy_refresh_continuous_aggregate
owner                  | team1
scheduled              | t
fixed_schedule         | f
initial_start          | 
hypertable_id          | 2
config                 | {"end_offset": "01:00:00", "start_offset": "3 days", "mat_hypertable_id": 2}
check_schema           | _timescaledb_internal
check_name             | policy_refresh_continuous_aggregate_check
timezone     
sotirissl commented 1 year ago

Hi @Harkishen-Singh and @alejandrodnm

Thanks for the update. I saw you mentioned "I see caggs working in the logs" Is the issue remaining ? Is anything you need help from us ?

alejandrodnm commented 1 year ago

I'm still getting the permission errors.

tsdbadmin@hqlnafqe94:tsdb> alter table team1.sample_h owner to team1;
permission denied for schema _timescaledb_internal

It only succeeds if the hypertable is empty:

tsdbadmin@hqlnafqe94:tsdb> select count(*) from team1.sample_h2;
╒═══════╕
│ count │
╞═══════╡
│ 0     │
╘═══════╛
SELECT 1
Time: 0.041s
tsdbadmin@hqlnafqe94:tsdb> alter table team1.sample_h2 owner to team1;
ALTER TABLE
jnidzwetzki commented 11 months ago

Hello @alejandrodnm,

Thanks for the detailed steps to reproduce the problem. I was able to reproduce the problem with PostgreSQL 14.9 and TimescaleDB 2.11.2 in our cloud. When changing the owner of the team1.sample_cagg view, I get the same error message.

tsdb=> ALTER MATERIALIZED VIEW team1.sample_cagg OWNER TO team1;
ERROR:  permission denied for schema _timescaledb_internal
alexeyklyukin commented 11 months ago

@alejandrodnm did you do (as tsdbadmin)

tsdb => GRANT USAGE, CREATE ON SCHEMA  _timescaledb_internal TO team1;

beforehand?

jnidzwetzki commented 11 months ago

@alejandrodnm I tested it in our dev cloud and it worked for me. I was able to change the owner of the CAGG afterward.

tsdb=> ALTER MATERIALIZED VIEW team1.sample_cagg OWNER TO team1;
ALTER MATERIALIZED VIEW
jnidzwetzki commented 11 months ago

As discussed on Slack, the issue is solved after executing GRANT USAGE, CREATE ON SCHEMA _timescaledb_internal TO team1;.

polly-utopi commented 7 months ago

Hi! As a follow-up to this, I'm just wondering if the refresh_continuous_aggregate function should only be executable by the owner of the continuous aggregate? I realise the above was resolved by setting the owner on the continuous aggregate, but we have multiple db users who should have access to refresh our continuous aggregates, but are unable to. Is this technically a bug, or is this intended?

pohmelie commented 1 month ago

@polly-utopi we have the same issue right now. Did you solve it somehow?