turbot / steampipe-postgres-fdw

The Steampipe foreign data wrapper (FDW) is a zero-ETL product that provides Postgres foreign tables which translate queries into API calls to cloud services and APIs. It's bundled with Steampipe and also available as a set of standalone extensions for use in your own Postgres database.
https://steampipe.io/
Apache License 2.0
71 stars 17 forks source link

Failing to apply where condition when joining `aws_ecr_image` #435

Open gabe-gfm opened 8 months ago

gabe-gfm commented 8 months ago

Describe the bug I'm seeing unexpected results using a WHERE clause on the aws_ecr_image table. All rows from a JOIN condition are being returned even though the query specifies most rows should be excluded by a WHERE condition.

This reproduction case involves joining a second table to provide a target label string for evaluating whether the target is in the JSONB list of aws_ecr_image.image_tags.

Insightfully, I found the problem does not occur qhen querying local-table-copy of the aws_ecr_image table. This leads me to suspect it's a foreign-data-wrapper bug. But I'm not confident in which project the bug exists ( steampipe-postgres-fdw, steampipe-plugin-aws, or steampipe?).

Steampipe version v0.21.4

Plugin version (steampipe plugin list) turbot/aws 0.132.0

To reproduce

  1. Have AWS ECR image data in aws_ecr_image
  2. Join a second table on ecr.repository_name` to provide data for specific repository and image_tag sets.
  3. Test for presence of specific image_tag in image_tags using the >@ JSONB inclusion test.

    @>: Checks if a JSON document contains another JSON document.

Example query

with id_tag_repo as -- DEBUG: this is a stand-in for AWS container task definitions
(
  select 1 as id, to_jsonb('ZZZ'::text) as image_tag, 'my-repo-A' as image_repository
  union
  select 2 as id, '"latest"'::jsonb as image_tag, 'my-repo-B' as image_repository
  union
  select 3 as id, '"latest"'::jsonb as image_tag, 'my-repo-A' as image_repository
)
select 
  id_tag_repo.image_tag as target_tag
  , id
  , ecr.image_tags @> image_tag as is_tag_in_tags -- DEBUG: Matches where condition. Expect to see only true rows
  , repository_name
  , image_tags
from aws_ecr_image ecr
-- from ecr_copy ecr -- DEBUG
join id_tag_repo
  on ecr.repository_name = id_tag_repo.image_repository
where
  ecr.image_tags @> image_tag
  1. Note the results of is_tag_in_tags column may show true, false, null, but the where condition should limit it to only true.

Buggy Results: image

Expected behavior The expected behavior can be easily demonstrated by running the same query on a local copy of the a foreign-data-wrapper table.

  1. Copy to a local table
    --drop table ecr_copy
    create table ecr_copy as select * from aws_ecr_image
  2. Re-running the above query, but targeting from ecr_copy rather than from aws_ecr_image
  3. Notice that only results where is_tag_in_tags column shows true.

Expected results: image

Additional context I found it difficult to discover how I might locally rebuild steampipe-ftw or the project's postgres container to investigate if this could be a known postgres bug.

bigdatasourav commented 8 months ago

@kaidaguerre, Could you please look into this issue?

kaidaguerre commented 7 months ago

Sorry for the delay @gabe-gfm , sure we'll look into this 👍

gabe-gfm commented 7 months ago

Perhaps this is related? https://github.com/turbot/steampipe-postgres-fdw/pull/440

gabe-gfm commented 7 months ago

@jreyesr @kaidaguerre

Unfortunately, this was not resolved using the most aws plugin + postgres foreign data wrapper (us-docker.pkg.dev/steampipe/steampipe/fdw:1.11.0-rc.6). It still returns all rows when attempting to filter for a list of values inside the jsonb image_tags array.

I did notice a particular log surface in my SQL client (not steampipe query):

2024-04-29 17:22:28 fdwGetForeignPlan - best_path->fdw_private is NULL. Defaulting to setting canPushdownAllSortFields to true

From https://github.com/turbot/steampipe-postgres-fdw/blob/41cf90e604f499f43723a01fc64e640b76d32bab/fdw/fdw.c#L328

gabe-gfm commented 7 months ago

I identified a better workaround for anyone else encountering this.

You may specify MATERIALIZED (docs) in a Postgres query to prevent the demonstrated incorrect query filter results.

MATERIALIZED is an optimization hint resulting in explicit fetching of a common table express (CTE) result without trying to push down potential optimizations into the query plan.

This example query gave me correctly filtered results.

-- id_tag_repo is a mock for identifying interesting AWS container task definitions linked to ECR images
with id_tag_repo as  
(
  select 1 as id, to_jsonb('ZZZ'::text) as image_tag, 'my-repo-A' as image_repository
  union
  select 2 as id, '"latest"'::jsonb as image_tag, 'my-repo-B' as image_repository
  union
  select 3 as id, '"latest"'::jsonb as image_tag, 'my-repo-A' as image_repository
), 
ecr_info as MATERIALIZED -- MATERIALIZED used to work around steampipe-fdw bug 
(
  select repository_name, image_tags
  from  aws_ecr_image
)
select 
  id_tag_repo.image_tag as target_tag
  , id
  , ecr.image_tags @> image_tag as is_tag_in_tags -- DEBUG: Matches where condition. Expect to see only true rows
  , repository_name
  , image_tags
from ecr_info ecr
join id_tag_repo
  on ecr.repository_name = id_tag_repo.image_repository
where
  ecr.image_tags @> image_tag
kaidaguerre commented 6 months ago

@gabe-gfm good that you've found a workaround. I need to dig into why the JSON join condition is causing this issue