Closed cbruno10 closed 2 months ago
@cbruno10 @e-gineer I've found the issue - the CLI code to read the timing data does not take into account multiple scans.
tl;dr - I've found the bug but need some guidance on the correct way to display timing info for multi-scan queries. Please see next comment.
For each scan that is executed, the FDW stores ScanMetadata
in the Hub
.
type ScanMetadata struct {
Id int
Table string
CacheHit bool
RowsFetched int64
HydrateCalls int64
Columns []string
Quals map[string]*proto.Quals
Limit int64
StartTime time.Time
Duration time.Duration
}
This is then used to populate steampipe_internal.steampipe_scan_metadata
foreign table.:
> select * from steampipe_internal.steampipe_scan_metadata
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| id | table | cache_hit | rows_fetched | hydrate_calls | start_time | duration | columns | limit | quals |
+-----+------------------+-----------+--------------+---------------+---------------------------+----------+--------------------------------------+-------+---------------------------------------+
| 191 | aws_ec2_instance | false | 1 | 0 | 2024-04-04T09:29:52+01:00 | 439 | ["instance_id","vpc_id","subnet_id"] | 0 | [ |
| | | | | | | | | | { |
| | | | | | | | | | "column": "subnet_id", |
| | | | | | | | | | "operator": "=", |
| | | | | | | | | | "value": "subnet-0a2c499fc37a6c1fe" |
| | | | | | | | | | } |
| | | | | | | | | | ] |
| | | | | | | | | | |
| 192 | aws_ec2_instance | false | 0 | 0 | 2024-04-04T09:29:53+01:00 | 433 | ["instance_id","vpc_id","subnet_id"] | 0 | [ |
| | | | | | | | | | { |
The CLI keeps track of the last row that was read from this table and reads all rows since the previously read row. (i.e. it reads the scan metadata for all scans in the current query). However at present it only takes a single row from the result and uses that to display the timing.
So, in the case of a join or other multi-scan the timing being displayed is the result of just one of the scans, and it therefore incorrect.
It is simple to update the code to include all scan metadata, but not so simple knowing how to combine this data to give useful output. (see next comment)
The currently logic for building the timing output from the scan metadata is as follows:
The displayed timing result is stored in a TimingMetadata
struct:
type TimingMetadata struct {
RowsFetched int64
CachedRowsFetched int64
HydrateCalls int64
}
This is populated from the steampipe_scan_metadata
data as follows:
timingResult.Metadata = &queryresult.TimingMetadata{}
timingResult.Metadata.HydrateCalls += scanRow.HydrateCalls
if scanRow.CacheHit {
timingResult.Metadata.CachedRowsFetched += scanRow.RowsFetched
} else {
timingResult.Metadata.RowsFetched += scanRow.RowsFetched
}
IOW, either CachedRowsFetched
or RowsFetched
is set, depending on whether the scan had a cache hit.
With multiple scans there are a couple of problems with this:
If we sum the RowsFetched
for each scan, it is likely we will have a number greater than the rows returned by the query (as there may be joining/filtering)
We could avoid this by setting RowsFetched to simply be the actual rows returned by the query.
This causes issues with the following point:
If some scans have a cache hit and some do not, do we count the result rows as cached or not?
I can see 3 options here:
Simply sum the rows returned by each scan. This may result in a higher row count that the actual result rows, but would allow granular reporting on cached and uncached rows
Report the muber of rows actually returned, and only mark the rows as cached if all scans had a cache hit
Expand the timing display to include more detailed information about each scan - perhaps have a verbose
mode for the . timing
option
Time: 158ms. Rows fetched: 36. Hydrate calls: 10.
Scans:
1) Table: aws_ec2_instance. Time: 78ms. Rows fetched: 36. Hydrate calls: 10. Quals: subnet_id = "subnet-a2c499fc37a6c1fe"
2) Table: aws_vpc_subnet. Time: 90ms. Rows fetched: 56. Hydrate calls: 0.
(numbers made up)
I quite like option 3
For queries that use joins, the counts incorrect and inconsistent, even if the results are correct:
If I run queries after these queries that don't have joins, their results are also incorrect then (but not if I run them beforehand):