reorg / pg_repack

Reorganize tables in PostgreSQL databases with minimal locks
BSD 3-Clause "New" or "Revised" License
1.91k stars 177 forks source link

Using repack to prune old data #279

Open fazalmajid opened 3 years ago

fazalmajid commented 3 years ago

At my company we use pg_repack primarily as a tool to delete obsolete historical data or for compliance reasons (e.g. GDPR). Running repack kills two birds with one stone: remove bloat but also remove old data where a long-running DELETE is impractical due to the maintenance window it would entail due to full table locks.

Our current implementation requires no C level changes to the extension, just patching the PL/PGSQL code, using the migration below. Essentially we add a table repack.filters that has a WHERE clause that specified what rows to keep, and optionall a repack.joins table with JOIN clauses if we need information from other tables to make the decision. If there is no matching row in those tables, the existing pg_repack behavior (copy everything) holds.

I think this is a fairly valuable feature and use case for pg_repack, and if there is interest, I'd be happy to package it as a PR for inclusion upstream.

CREATE TABLE IF NOT EXISTS repack.filters (
  tablename TEXT PRIMARY KEY,
  clause    TEXT
);

CREATE TABLE IF NOT EXISTS repack.joins (
  tablename TEXT PRIMARY KEY,
  joins     TEXT NOT NULL
);

DO $$
DECLARE
  _v TEXT;
  _vn TEXT;
BEGIN
  UPDATE pg_proc
  SET prosrc = REPLACE(prosrc, 'quote_ident', '''T.''||quote_ident')
  WHERE proname='get_columns_for_create_as'
    AND pronamespace=(SELECT oid FROM pg_namespace WHERE nspname='repack');

  SELECT definition INTO _v
  FROM pg_views WHERE schemaname='repack' AND viewname='tables';
  IF _v ~ 'repack.filters' THEN
    RAISE EXCEPTION 'repack filters patch already applied';
  END IF;
  EXECUTE FORMAT('CREATE OR REPLACE VIEW repack.tables AS %s', _v);
  _vn := REPLACE(_v, 'LEFT JOIN pg_class t ON ((r.reltoastrelid = t.oid))',
                 'LEFT JOIN pg_class T ON R.reltoastrelid = T.oid
     LEFT JOIN repack.filters F ON R.relname=F.tablename
     LEFT JOIN repack.joins J ON R.relname=J.tablename');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 1';
  END IF;
  _v := _vn;
  _vn := REPLACE(_vn,
    ' AS copy_data',
    ' || '' T'' || COALESCE(J.joins, '' '') || COALESCE('' WHERE '' || F.clause || '' '', '''') AS copy_data');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 2';
  END IF;
  _v := _vn;
  _vn := REPLACE(_vn,
    ' AS create_table_2',
    ' || '' T'' AS create_table_2');
  IF _vn = _v THEN
    RAISE EXCEPTION 'repack.table view definition changed, cannot patch 3';
  END IF;
  EXECUTE FORMAT('CREATE OR REPLACE VIEW repack.tables AS %s', _vn);
END;   
$$;

So for instance if you have the table definitions:

CREATE TABLE contract(
  id SERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  expiration DATE NULL
);
CREATE TABLE employee(
  id SERIAL PRIMARY KEY
  name TEXT NOT NULL
  contract INT NOT NULL REFERENCES contract(id)
);

And you want to repack out employees with expired contracts, you would:

INSERT INTO repack.filters (tablename, clause)
VALUES ('employee', 'COALESCE(contract.expiration, CURRENT_DATE) >= CURRENT_DATE');
INSERT INTO repack.joins (tablename, joins)
VALUES ('employee', 'JOIN contract ON employee.contract=contract.id');
scottcc commented 8 months ago

👍

Ten0 commented 2 months ago

I'm doing something similar here: https://github.com/Ten0/pg_repack/commit/27ddafbd97415acb3310ac08a87458b82e58d10e

I have used that to reorder through joins, but not to filter out lines.

I'm wondering about whether I can use this to filter out lines (for a similar reason as yours). I assume this would cause issues if there were foreign key constraints (they wouldn't be checked), but I would expect that if there aren't it shouldn't. Is that correct, or would swapping the tables after not inserting as many lines as there originally were cause issues?

fazalmajid commented 2 months ago

Like all DBA-level operations, it should be used carefully, of course, and if you have foreign keys, you would need to repack/purge the tables in topological sort order. That said, we've used this in production with fairly large tables (100+GB) with no issues. You will typically need to quiesce traffic to the DB at the very beginning and very end so the DDL locks used to swap the tables can be taken.