Open davidfetter opened 3 years ago
Hmm. Hungarian notation. Never heard of it until now. (Or rather I probably did but forgot.) I am from the 90's though :)
I think I named things that way just to keep things straight for myself while I was writing. At some point, at least, there were argument and declared variable names that were a little too similar, and I was too lazy to think of better ones. Happy to receive a PR making it all less of a sharp stick in the eye to the rest of the world.
I can certainly start working on commenting things. Can you suggest a couple obvious not-immediately-obvious places for me to start? I'm under the gun at my day job ATM; I'll work on it as I can.
BTW, your involvement in this has been very informative to me. Thank you!
Happy to help!
For what it's worth, this is saving me a lot of time for a thing I'm doing at my work, so what I'm doing is somewhat self-serving :)
I'm still staring at the code and comments and I'm still really confused, but I failed to explain what I found so confusing, and for that I apologize.
The package consists of a view, which contains columns never used elsewhere in the package whose names are a teensy bit confusing, along with two functions whose intent I'm a little cloudy on.
=== v_fk_con
===
ctab
, and I'm not sure what they signify that's different from the primary key in the referenced table. What is going on there?pg_class
, and LATERAL
constructs, which are a little easier to isolate as to what they're doing and how. Something along the lines of this, although I'm not quite sure it needs the recursive part, or that other things would benefit from it.
WITH RECURSIVE rd AS (
SELECT
/* Stuff we care about in other things */
co.oid AS constraint_oid,
co.conname AS constraint_name,
/* Ugh.
* In pg_constraint, our point of view is the table being constrained,
* which is to say the referring table, so most, but not all, things with
* 'f' in them refer to the referencing table, not to the referenced table.
*
* In particular,
* - confdeltype refers to the ON DELETE action,
* - confrelid is the referring table,
* - conrelid is the referenced table.
*/
co.confdeltype AS delete_action,
co.confrelid AS pk_table_oid,
co.conrelid AS fk_table_oid,
/* The casts to regclass get schema and table, quoted as needed */
co.confrelid::regclass::text AS pk_table,
co.conrelid::regclass::text AS fk_table,
pk_cols.pk_column_names,
fk_cols.fk_column_names,
/* Stuff we need to keep track */
ARRAY[co.confrelid::regclass::text, co.conrelid::regclass::text] AS "path",
false AS has_cycle
FROM
/* Constraints */
pg_catalog.pg_constraint AS co
CROSS JOIN
/*
* Column name(s) in primary key.
* The UNNEST/JOIN business keeps them in the order they have in the key.
* Yay, arrays/unnest!
*/
LATERAL (
SELECT pg_catalog.array_agg(pa.attname ORDER BY pt.ord)
FROM
UNNEST(co.confkey) WITH ORDINALITY pt(attnum, ord)
JOIN
pg_catalog.pg_attribute pa
ON (
co.confrelid = pa.attrelid AND
pt.attnum = pa.attnum
)
) AS pk_cols(pk_column_names)
CROSS JOIN
/* Column name(s) in foreign key */
LATERAL (
SELECT pg_catalog.array_agg(fa.attname ORDER BY ft.ord)
FROM
UNNEST(co.conkey) WITH ORDINALITY ft(attnum, ord)
JOIN
pg_catalog.pg_attribute fa
ON (
co.conrelid = fa.attrelid AND
ft.attnum = fa.attnum
)
) AS fk_cols(fk_column_names)
WHERE
co.contype = 'f' AND
/* Start with tables that don't have FKs to other tables */
NOT EXISTS (
SELECT 1
FROM pg_catalog.pg_constraint cot
WHERE (
cot.contype = 'f' AND
cot.conrelid = co.confrelid
)
)
UNION ALL
SELECT
/* Stuff we care about in other things */
co.oid AS constraint_oid,
co.conname AS constraint_name,
co.confdeltype AS delete_action,
co.conrelid AS fk_table_oid,
co.confrelid AS pk_table_oid,
/* The casts to regclass get schema and table, quoted as needed */
co.conrelid::regclass::text AS fk_table,
co.confrelid::regclass::text AS pk_table,
pk_cols.pk_column_names,
fk_cols.fk_column_names,
/* Stuff we need to keep track */
rd."path" || co.confrelid::regclass::text AS "path",
co.confrelid::regclass::text = ANY(rd."path") AS has_cycle
FROM
/* Constraints */
pg_catalog.pg_constraint AS co
CROSS JOIN
/*
* Column name(s) in primary key.
* The UNNEST/JOIN business keeps them in the order they have in the key.
* Yay, arrays/unnest!
*/
LATERAL (
SELECT pg_catalog.array_agg(pa.attname ORDER BY pt.ord)
FROM
UNNEST(co.confkey) WITH ORDINALITY pt(attnum, ord)
JOIN
pg_catalog.pg_attribute pa
ON (
co.confrelid = pa.attrelid AND
pt.attnum = pa.attnum
)
) AS pk_cols(pk_column_names)
CROSS JOIN
/* Column name(s) in foreign key */
LATERAL (
SELECT pg_catalog.array_agg(fa.attname ORDER BY ft.ord)
FROM
UNNEST(co.conkey) WITH ORDINALITY ft(attnum, ord)
JOIN
pg_catalog.pg_attribute fa
ON (
co.conrelid = fa.attrelid AND
ft.attnum = fa.attnum
)
) AS fk_cols(fk_column_names)
JOIN
rd
ON (
co.contype = 'f' AND
/* Hook up to the previous table */
co.confrelid = rd.fk_table_oid AND
NOT rd.has_cycle
)
)
SELECT *
FROM rd;
=== _recursively_delete
===
jsonb
structures, and I don't quite get why it's doing those. Are they all necessary?recursively_delete
?WITH RECURSIVE
clause with a short list of parameters that define the top of the recursion rather than a PL/pgsql function that calls itself, passing itself lots of parameters of non-obvious purpose. Maybe I'm just weird that way.=== recursively_delete
===
for_realz
! This is awesome!DECLARE
section. It's kinda intimidating. Some documents of what it's all for would be great should it ever need to be modified.SELECT
statement could get rolled, as far as I can tell, into the view.v_fk_cons
extending the LATERAL
clauses in what I hope is the obvious way, i.e. adding another aggregate with an ORDER BY
to their target lists and making them visible in the view.$CTE_AUX_STMT
is supposed to do. It goes out of its way to return a row at each section of the CTE
, but I don't understand why.IN
to an EXISTS
construct because of the giant fun that IN
can inflict, and rearrange the UNION
to a UNION ALL
in order to avoid the extra scan UNION
forces because it's short for UNION DISTINCT
through what might be a very long list something like this:
%I_del AS (
DELETE FROM %I WHERE EXISTS (
SELECT 1 FROM %s WHERE (%s) IN (%s)
)
RETURNING * /* Should this just be the `PK` columns? */
),
%I AS (
SELECT * FROM %I_del
UNION ALL
SELECT %s /* all those NULLs */
/* Maybe make it optional? Doing that would be doable by adding this `WHERE` clause:
WHERE NOT EXISTS (SELECT 1 FROM %_del)
*/
)
COUNT(*)
together with a UNION ALL
instead of a UNION
.Hey, David, this is great stuff, thank you! I'm tied up until Wednesday, so I'll try to get your questions answered point by point by Thursday.
=== v_fk_cons
===
The name v_fk_cons (short for something like "view of foreign key constraints" as I'm sure you've guessed) is crap, I know; it's a working title from long ago that never got changed.
In v_fk_cons, every row describes a single foreign key constraint, and the view is unique on constraint OID. v_fk_cons is not about primary (or unique) key constraints, except insofar as they relate to foreign key constraints.
The ctab* columns correspond to "child" tables; the ptab* columns to "parent" tables. An FK's child table is where the constraint is defined; an FK's parent table is the table that's referenced. Columns for PKs (and on the parent, unique keys) are given for both parent and child so that the view can be traversed in recursion, linking parents to children (to grandchildren, etc.).
View "public.v_fk_cons"
Column | Type | Collation | Nullable | Default
-------------------+--------+-----------+----------+---------
oid | oid | | |
name | name | | |
delete_action | "char" | | |
ctab_oid | oid | | |
ctab_schema_name | name | | |
ctab_name | name | | |
ctab_pk_col_names | name[] | | |
ctab_pk_col_types | text[] | | |
ctab_fk_col_names | name[] | | |
ctab_fk_col_types | text[] | | |
ptab_oid | oid | | |
ptab_schema_name | name | | |
ptab_name | name | | |
ptab_uk_col_names | name[] | | |
ptab_uk_col_types | text[] | | |
All columns should be considered in relation to a single foreign key constraint:
oid: The FK constraint OID.
name: The FK constraint name.
delete_action: The FK constraint delete action.
ctab_oid: The OID of the table upon which the constraint is defined.
ctab_schema_name: The name of the schema of the table upon which the constraint is defined.
ctab_name: The name of the table upon which the constraint is defined.
ctab_pk_col_names: The name(s) of the column(s) comprising the primary key of the table upon which the constraint is defined. The names are given in definition order (here, and likewise in the other names/types columns).
ctab_pk_col_types: The name(s) of the type(s) of the column(s) comprising the primary key of the table upon which the constraint is defined.
ctab_fk_col_names: The name(s) of the column(s) comprising the foreign key in the table upon which the constraint is defined.
ctab_fk_col_types:: The name(s) of the type(s) of the column(s) comprising the foreign key in the table upon which the constraint is defined.
ptab_oid: The OID of the table that the constraint references.
ptab_schema_name: The name of the schema of the table that the constraint references.
ptab_name: The name of the table that the constraint references.
ptab_uk_col_names: In the table that the constraint references, the name(s) of the column(s) comprising the primary or unique key referenced.
ptab_uk_col_types: In the table that the constraint references, the name(s) of the type(s) of the column(s) comprising the primary or unique key referenced.
Your new view definition is very interesting -- there's some stuff there that I haven't seen before. It'll take more time than I have at the moment to dissect and understand it, but I did run it against some production data. I think there may be a bad linkage in there somewhere.
Once has_cycles flips to true (as the records occur in the output, top to bottom), it remains true through the end of the output.
has_cycles, when it flips, flips for the record after the one it should have.
And once has_cycles flips, other things get weird:
Duplicate constraints can occur.
Where a single constraint occurs twice (same oid, same pk_table_oid, same fk_table_oid), the records differ on path (which isn't possible).
The paths stop making sense.
For example, where the damage preview for a deletion from our production table 'users' looks like this (abridged)...
INFO: 1 users
INFO: 0 a | ad_submissions.["user_id"]
INFO: 0 c | | ad_leads.["ad_submission_id"]
INFO: 0 c | | ad_stats.["ad_submission_id"]
...and the relevant v_fk_cons records look like this...
oid | name | delete_action | ctab_oid | ctab_schema_name | ctab_name | ctab_pk_col_names | ctab_pk_col_types | ctab_fk_col_names | ctab_fk_col_types | ptab_oid | ptab_schema_name | ptab_name | ptab_uk_col_names | ptab_uk_col_types
----------+------------------------------------------------+---------------+----------+------------------+-------------------------------+-------------------+-------------------+-------------------------------+-------------------+----------+------------------+----------------------------+-------------------+-------------------
11087213 | ad_leads_ad_submission_id_fk | c | 10771849 | public | ad_leads | {id} | {integer} | {ad_submission_id} | {integer} | 10771870 | public | ad_submissions | {id} | {integer}
11087218 | ad_stats_ad_submission_id_fk | c | 10771857 | public | ad_stats | {id} | {integer} | {ad_submission_id} | {integer} | 10771870 | public | ad_submissions | {id} | {integer}
11087223 | ad_submissions_user_id_fk | a | 10771870 | public | ad_submissions | {id} | {integer} | {user_id} | {integer} | 10772296 | public | users | {id} | {integer}
...the corresponding records from the new view look like this:
constraint_oid | constraint_name | delete_action | pk_table_oid | fk_table_oid | pk_table | fk_table | pk_column_names | fk_column_names | path | has_cycle
----------------+-----------------------------------------------------------------+---------------+--------------+--------------+----------------------------------------------------+-------------------------------+-----------------+---------------------------------------------+----------------------------------------------------+-----------
11087223 | ad_submissions_user_id_fk | a | 10771870 | 10772296 | ad_submissions | users | {id} | {user_id} | {roles,users,users} | t
11087223 | ad_submissions_user_id_fk | a | 10771870 | 10772296 | ad_submissions | users | {id} | {user_id} | {clients,users,users} | t
The ad_leads_ad_submission_id_fk and ad_stats_ad_submission_id_fk constraints don't occur at all; ad_submissions_user_id_fk occurs twice; and I'm not sure where those paths came from. For example, 'users' does attach to 'roles', but on a completely separate branch having nothing to do with ad_submissions. (Also, has_cycle is true on both records, but there's really no cycle anywhere near there.)
I'm not sure what you mean by, "take advantage of some casts, averting probes into pg_class." I see where you comment, "The casts to regclass get schema and table," but there's no schema that I can see in the final output. Are those columns meant to be cast from REGCLASS to TEXT?
co.conrelid::regclass::text AS fk_table,
co.confrelid::regclass::text AS pk_table,
Anyway, I made some of the choices about v_fk_cons to keep it human-readable and "useful for other stuff, maybe," to paraphrase an earlier comment of yours. That's why the explicit names, schemas, etc. There are some columns in the view that aren't even in use, that are just there for "completeness" or maybe "symmetry."
=== _recursively_delete
===
_recursively_delete, is the recursive bit, yes; it calls itself, and the main function calls it. Another crap name. I couldn't come up with a better one, so I fell back to an old naming kludge I use, figuring a better name would occur to me later (it never does).
Yes, it traverses v_fk_cons to build the "flat graph." PL/pgSQL isn't super-friendly for building tree objects (I could only bend JSONB so far), so I built the graph as an array where each row is a node, in visitation order.
The pulling things in and out of JSON structures -- and most of the other gymnastics you see -- are me trying procedurally to generate one single, clean SQL statement to accomplish recursive deletion (using the paradigms with which I'm most familiar and comfortable). I'm sure there are way better ways to do lots of the things I did. I do feel that the simplicity of the final query is ultimately more important than the simplicity of the code that creates it. But anything that simplifies the code without breaking the query sounds good to me.
Normally I like to have functions advertised as "recursive" be self-contained. Initially I tried for that in this case. But in the end the signatures of the main and recursive calls were too different; and there was value for my brain in separating the "straight-shot" stuff from the recursive stuff.
The ARGs are given in an order that I felt made decent sense in terms of interface. The declared VARs are alphabetized.
I... don't get that bit. You may be very right, and that's a better approach, I don't know. I don't know what that would even look like. You'll have to show me what you're thinking.
I wrote it using procedural recursion because that's a comfortable area for me.
=== recursively_delete
===
for_realz didn't start out defaulting false! I changed that pretty quick, though, after I stabbed myself with it a couple times.
I'll certainly document the declare section (again, alphabetized). It's big, yep. I don't think there's a lot of fat to trim there, though, unless there are some pretty drastic changes.
I'm not sure how you'd do that. Most of the PKs are already there in v_fk_cons -- they're just denormalized on their associated FKs (with each distinct FK serving as the "primary key" of the view).
I think the way to go is probably to normalize the v_fk_cons view, factoring out the PK/UK stuff to a new v_pk_cons view. What do you think?
I don't completely follow, but I think I like the sound of what you suggest. See above, though. I don't think v_fk_cons will help much, but a v_pk_cons might.
Not sure what you mean. Are you talking about the 'returning *' bit? If so, look at this (abridged) final query:
WITH
"del_0$ROOT" AS (
DELETE FROM users WHERE (id) IN (
WITH RECURSIVE
self_ref (id) AS (
SELECT id FROM users WHERE (id) IN (4402)
UNION
SELECT NULL
)
SELECT id FROM self_ref
) RETURNING *
)
,
"del_1$11087223" AS (
DELETE FROM ad_submissions WHERE (id) IN (
WITH RECURSIVE
self_ref (id) AS (
SELECT id FROM ad_submissions WHERE (user_id) IN (SELECT id FROM "del_0$ROOT")
UNION
SELECT NULL
)
SELECT id FROM self_ref
) RETURNING *
)
,
"del_2$11087213" AS (
DELETE FROM ad_leads WHERE (id) IN (
WITH RECURSIVE
self_ref (id) AS (
SELECT id FROM ad_leads WHERE (ad_submission_id) IN (SELECT id FROM "del_1$11087223")
UNION
SELECT NULL
)
SELECT id FROM self_ref
) RETURNING *
)
,
-- ...
SELECT '0' AS queue_i, count(*) AS n_del FROM "del_0$ROOT"
UNION
SELECT '1' AS queue_i, count(*) AS n_del FROM "del_1$11087223"
UNION
SELECT '2' AS queue_i, count(*) AS n_del FROM "del_2$11087213"
UNION
-- ...
Each individual CTE aux statement is performing a delete; 'RETURNING *' allows for the unioned select at the end that gathers up deletion counts for the preview (and, yep, we don't need those for for_realz calls). A benefit of doing it this way is that an individual deletion is only tabulated once, by the first FK constraint to touch the row in the graph; if the row happens also to be affected by another FK constraint later in the graph, it won't be counted again.
I'm aware of (but no expert on) the benefits of EXISTS vs IN -- something I learned in my Oracle days, may they never return. But I thought the main benefit was with NOT EXISTS vs NOT IN? Anyway, I don't recall why I went with IN in those aux statements, but I do recall really fighting to find syntax that worked. If there's a performance win to be had, we should definitely look into EXISTS there.
And the UNION ALL -- yep, good suggestion. Totally missed that one.
Simply replacing the 'RETURNING ' with 'RETURNING \
There are bits of the code whose purpose is...not immediately obvious upon inspection, which makes debugging them, writing tests of their functionality, etc., challenging.
Also, it'd be nice to have
COMMENT ON
annotations for each column in the view.Speaking of the view, Hungarian notation kinda when out with the '90s, so I'd like to send a PR fixing that, if you're OK with it.