open-contracting / kingfisher-summarize

Creates SQL tables that summarize the OCDS data in collections from Kingfisher Process
https://kingfisher-summarize.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
3 stars 8 forks source link

Researching improvements to field-counts #159

Closed jpmckinney closed 3 years ago

jpmckinney commented 3 years ago

Some similar implementations:

https://github.com/fhirbase/master-class/blob/93692db5094e1ca56aba7c0adb671fa6104f4e3e/sql-result https://github.com/cajnoj/sql/blob/4d3b44ddb8009212abeba5e07da4a15f22348d03/PostgreSQL/JSON/traverse_jsonb.sql https://github.com/Anonim147/GoDAPP/blob/ce79df128850dc211681936ada3e1bcc0565f7e3/sql/recursive_with_array_solved.sql https://github.com/MaayanLab/signature-commons-metadata-api/blob/2f83022cb8d41f06caf8e14ab38d94393c8a37ec/src/migration/1568055725225-jsonb-deep-key-value.ts

jpmckinney commented 3 years ago

The style with INNER JOIN LITERAL from https://github.com/MaayanLab/signature-commons-metadata-api/blob/master/src/migration/1568055725225-jsonb-deep-key-value.ts is about 10% faster:

CREATE OR REPLACE FUNCTION flatten (jsonb)
    RETURNS TABLE (
        path text,
        object_property integer,
        array_item integer)
    LANGUAGE 'sql'
    PARALLEL SAFE
    AS $$

    -- https://www.postgresql.org/docs/current/queries-with.html
    WITH RECURSIVE t (
        KEY,
        value,
        object_property,
        array_item
    ) AS (
        SELECT
            j.key,
            j.value,
            1,
            0
        FROM
            jsonb_each($1) AS j
        UNION ALL (
            SELECT
                CASE tt.path_count
                WHEN 1 THEN
                    concat(t.key, '/', tt.key)
                ELSE
                    t.key
                END,
                tt.value,
                tt.path_count,
                tt.array_count
            FROM
                t
            INNER JOIN LATERAL (
                SELECT
                    ttt.key,
                    ttt.value,
                    1 path_count,
                    0 array_count
                FROM
                    jsonb_each(t.value) AS ttt
                WHERE
                    jsonb_typeof(t.value) = 'object'
                UNION ALL
                SELECT
                    '',
                    ttt.value,
                    0 path_count,
                    1 array_count
                FROM
                    jsonb_array_elements(t.value) AS ttt
                WHERE
                    jsonb_typeof(t.value) = 'array'
                    AND jsonb_typeof(t.value -> 0) = 'object'
            ) AS tt ON TRUE
        )
    )
SELECT
    KEY AS path,
    object_property,
    array_item
FROM
    t;

$$;
jpmckinney commented 3 years ago

In summary:

The fhirbase/master-class is fastest, which is not that surprising, as it's also the simplest (no subquery, no join).

jpmckinney commented 3 years ago

We can maybe reduce the number of iterations performed by WITH RECURSIVE by calling jsonb_each on values of jsonb_array_elements like in https://github.com/elixir-oslo/trackfind/blob/7559e5f9f517b44b93d76caa762424e8c6811444/src/main/resources/schema.sql#L170-L212, if we had a way to also track the number of array entries.

Update: Tried:

            SELECT
                prev.key || '/' || next.key,
                next.value,
                1,
                1
            FROM
                prev,
                jsonb_array_elements(prev.value) entry,
                jsonb_each(entry) next
            WHERE
                jsonb_typeof(prev.value) = 'array'
                AND jsonb_typeof(entry) = 'object'

But it's not possible to get results like:

collection_id,release_type,path,object_property,array_count,distinct_releases
1,release,awards,100,301,100
1,release,awards/amendment,301,0,100

if we try to accumulate object_property and array_count in the same iteration. The above yields:

1,release,awards,100,0,100
1,release,awards/amendment,301,301,100