turbot / steampipe

Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.
https://steampipe.io
GNU Affero General Public License v3.0
6.67k stars 262 forks source link

Rows fetched and Hydrate calls counts are incorrect when using joins #4237

Closed cbruno10 closed 2 months ago

cbruno10 commented 3 months ago

For queries that use joins, the counts incorrect and inconsistent, even if the results are correct:

cbruno@M1P aws % steampipe query
Welcome to Steampipe v0.22.1
For more information, type .help
> select
  i.instance_id,
  i.vpc_id,
  i.subnet_id,
  s.tags ->> 'Name' as subnet_name
from
  aws_ec2_instance as i,
  aws_vpc_subnet as s
where
  i.subnet_id = s.subnet_id;
+---------------------+--------------+-----------------+-------------+
| instance_id         | vpc_id       | subnet_id       | subnet_name |
+---------------------+--------------+-----------------+-------------+
| i-0a27d06a62e979b7a | vpc-d536adaf | subnet-8bc6d8c1 | <null>      |
| i-0bd0fba70e1c158c9 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-076c621c9eafe3bd8 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-06e0de1b8436a1f50 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-0ca989faeabbb47c9 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-0e31039a58e5b9623 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-00a8f549d22319d8e | vpc-d536adaf | subnet-ef6237c1 | <null>      |
+---------------------+--------------+-----------------+-------------+

Time: 4.0s. Rows fetched: 1. Hydrate calls: 0.
> select
  i.instance_id,
  i.vpc_id,
  i.subnet_id,
  s.tags ->> 'Name' as subnet_name
from
  aws_ec2_instance as i,
  aws_vpc_subnet as s
where
  i.subnet_id = s.subnet_id;
+---------------------+--------------+-----------------+-------------+
| instance_id         | vpc_id       | subnet_id       | subnet_name |
+---------------------+--------------+-----------------+-------------+
| i-0a27d06a62e979b7a | vpc-d536adaf | subnet-8bc6d8c1 | <null>      |
| i-0bd0fba70e1c158c9 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-076c621c9eafe3bd8 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-06e0de1b8436a1f50 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-0ca989faeabbb47c9 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-0e31039a58e5b9623 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-00a8f549d22319d8e | vpc-d536adaf | subnet-ef6237c1 | <null>      |
+---------------------+--------------+-----------------+-------------+

Time: 1.4s. Rows fetched: 3. Hydrate calls: 0.
> select
  i.instance_id,
  i.vpc_id,
  i.subnet_id,
  s.tags ->> 'Name' as subnet_name
from
  aws_ec2_instance as i,
  aws_vpc_subnet as s
where
  i.subnet_id = s.subnet_id;
+---------------------+--------------+-----------------+-------------+
| instance_id         | vpc_id       | subnet_id       | subnet_name |
+---------------------+--------------+-----------------+-------------+
| i-0a27d06a62e979b7a | vpc-d536adaf | subnet-8bc6d8c1 | <null>      |
| i-0bd0fba70e1c158c9 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-076c621c9eafe3bd8 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-06e0de1b8436a1f50 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-0ca989faeabbb47c9 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-0e31039a58e5b9623 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-00a8f549d22319d8e | vpc-d536adaf | subnet-ef6237c1 | <null>      |
+---------------------+--------------+-----------------+-------------+

Time: 1.4s. Rows fetched: 0. Hydrate calls: 0.
> select
  i.instance_id,
  i.vpc_id,
  i.subnet_id,
  s.tags ->> 'Name' as subnet_name
from
  aws_ec2_instance as i,
  aws_vpc_subnet as s
where
  i.subnet_id = s.subnet_id;
+---------------------+--------------+-----------------+-------------+
| instance_id         | vpc_id       | subnet_id       | subnet_name |
+---------------------+--------------+-----------------+-------------+
| i-0a27d06a62e979b7a | vpc-d536adaf | subnet-8bc6d8c1 | <null>      |
| i-0bd0fba70e1c158c9 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-076c621c9eafe3bd8 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-06e0de1b8436a1f50 | vpc-d536adaf | subnet-049cd758 | <null>      |
| i-0ca989faeabbb47c9 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-0e31039a58e5b9623 | vpc-d536adaf | subnet-ef6237c1 | <null>      |
| i-00a8f549d22319d8e | vpc-d536adaf | subnet-ef6237c1 | <null>      |
+---------------------+--------------+-----------------+-------------+

Time: 1.5s. Rows fetched: 3. Hydrate calls: 0.

If I run queries after these queries that don't have joins, their results are also incorrect then (but not if I run them beforehand):

> select name from azure_cody_aaa.azure_resource_group
+------------------+
| name             |
+------------------+
| dev              |
| NetworkWatcherRG |
| turbot_rg        |
+------------------+

Time: 12ms. Rows fetched: 0. Hydrate calls: 0.
kaidaguerre commented 3 months ago

@cbruno10 @e-gineer I've found the issue - the CLI code to read the timing data does not take into account multiple scans.

tl;dr - I've found the bug but need some guidance on the correct way to display timing info for multi-scan queries. Please see next comment.

Details

For each scan that is executed, the FDW stores ScanMetadata in the Hub.

type ScanMetadata struct {
    Id           int
    Table        string
    CacheHit     bool
    RowsFetched  int64
    HydrateCalls int64
    Columns      []string
    Quals        map[string]*proto.Quals
    Limit        int64
    StartTime    time.Time
    Duration     time.Duration
}

This is then used to populate steampipe_internal.steampipe_scan_metadata foreign table.:

> select * from steampipe_internal.steampipe_scan_metadata 
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| id  | table            | cache_hit | rows_fetched | hydrate_calls | start_time                | duration | columns                              | limit | quals                                 |
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| 191 | aws_ec2_instance | false     | 1            | 0             | 2024-04-04T09:29:52+01:00 | 439      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |
|     |                  |           |              |               |                           |          |                                      |       |   "column": "subnet_id",              |
|     |                  |           |              |               |                           |          |                                      |       |   "operator": "=",                    |
|     |                  |           |              |               |                           |          |                                      |       |   "value": "subnet-0a2c499fc37a6c1fe" |
|     |                  |           |              |               |                           |          |                                      |       |  }                                    |
|     |                  |           |              |               |                           |          |                                      |       | ]                                     |
|     |                  |           |              |               |                           |          |                                      |       |                                       |
| 192 | aws_ec2_instance | false     | 0            | 0             | 2024-04-04T09:29:53+01:00 | 433      | ["instance_id","vpc_id","subnet_id"] | 0     | [                                     |
|     |                  |           |              |               |                           |          |                                      |       |  {                                    |

The CLI keeps track of the last row that was read from this table and reads all rows since the previously read row. (i.e. it reads the scan metadata for all scans in the current query). However at present it only takes a single row from the result and uses that to display the timing.

So, in the case of a join or other multi-scan the timing being displayed is the result of just one of the scans, and it therefore incorrect.

It is simple to update the code to include all scan metadata, but not so simple knowing how to combine this data to give useful output. (see next comment)

kaidaguerre commented 3 months ago

The currently logic for building the timing output from the scan metadata is as follows:

The displayed timing result is stored in a TimingMetadata struct:

type TimingMetadata struct {
    RowsFetched       int64
    CachedRowsFetched int64
    HydrateCalls      int64
}

This is populated from the steampipe_scan_metadata data as follows:

    timingResult.Metadata = &queryresult.TimingMetadata{}
    timingResult.Metadata.HydrateCalls += scanRow.HydrateCalls
    if scanRow.CacheHit {
        timingResult.Metadata.CachedRowsFetched += scanRow.RowsFetched
    } else {
        timingResult.Metadata.RowsFetched += scanRow.RowsFetched
    }

IOW, either CachedRowsFetched or RowsFetched is set, depending on whether the scan had a cache hit.

With multiple scans there are a couple of problems with this:

1) Incorrect row count

If we sum the RowsFetched for each scan, it is likely we will have a number greater than the rows returned by the query (as there may be joining/filtering)

We could avoid this by setting RowsFetched to simply be the actual rows returned by the query.

This causes issues with the following point:

2) How to handle mix of cache hits and cache misses

If some scans have a cache hit and some do not, do we count the result rows as cached or not?

kaidaguerre commented 3 months ago

I can see 3 options here:

Option 1

Simply sum the rows returned by each scan. This may result in a higher row count that the actual result rows, but would allow granular reporting on cached and uncached rows

Option 2

Report the muber of rows actually returned, and only mark the rows as cached if all scans had a cache hit

Option 3

Expand the timing display to include more detailed information about each scan - perhaps have a verbose mode for the . timing option

Time: 158ms. Rows fetched: 36. Hydrate calls: 10.

Scans:
  1) Table: aws_ec2_instance. Time: 78ms. Rows fetched: 36. Hydrate calls: 10.  Quals: subnet_id = "subnet-a2c499fc37a6c1fe"

  2) Table: aws_vpc_subnet. Time: 90ms. Rows fetched: 56. Hydrate calls: 0.  

(numbers made up)

I quite like option 3