Open hiltontj opened 2 months ago
Partitions doesn't make sense to bring over as OSS doesn't use the same scheme. We'll want to have a system table for listing parquet files I think.
I was wondering about that since partitions are really a distributed thing. It looks like we do fabricate a partition ID, but I guess that is mostly to satisfy the QueryChunk
trait?
Either way, as long as there is no compatibility issues, I suppose we could have a system.parquet_files
provider in lieu of system.partitions
and system.tables
(which seems useless in absence of system.partitions
)
Yeah, the partition ID is just a stub to satisfy the trait.
For the system.parquet_files
table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.
For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.
Distributed doesn't enable system tables unless you set a special debug flag to avoid the potential issue of returning a massive amount of stuff.
One thing you could do is basically require a table_name = 'foo'
and a databse_name = 'foo'
predicate to be present
That would be implemented by
_filter
Expr
s passed to the TableProvider::scan
https://github.com/influxdata/influxdb_iox/blob/ca887899c73b57b710b371da1ca79fafced1831b/querier/src/system_tables/mod.rs#L145-L150BinaryExpr
)@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency. Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.
@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency.
Makes sense to me
Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.
I think there are different opinions on this topic
I personally like system tables as then you have immediate filtering/joining/aggregation via SQL without any additional tools
However, I believe others prefer REST APIs so that it is easier to use other tools for analysis.
Having it in the query language would expose it via the /query API, which may not be as convenient as dedicated endpoints, but is still workable. We could then hook up dedicated REST endpoints if people desire them.
I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with. That is not the case with the current system tables in iox though, and I don't think would be for listing out parquet file info.
I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with.
That is a good point -- DataFusion / Rust can handle structured types (e.g StructArray, MapArray and ListArray) but the support is definitely not a full featured as it could be
To add some detail, the SHOW TABLES
and SHOW COLUMNS
queries, which are already supported, provide the following:
influxdb3 query -d foo 'SHOW TABLES'
+---------------+--------------------+-------------+------------+
| table_catalog | table_schema | table_name | table_type |
+---------------+--------------------+-------------+------------+
| public | iox | cpu | BASE TABLE |
| public | iox | mem | BASE TABLE |
| public | information_schema | tables | VIEW |
| public | information_schema | views | VIEW |
| public | information_schema | columns | VIEW |
| public | information_schema | df_settings | VIEW |
| public | information_schema | schemata | VIEW |
+---------------+--------------------+-------------+------------+
influxdb3 query -d foo 'SHOW COLUMNS FROM cpu'
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type | is_nullable |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| public | iox | cpu | host | Dictionary(Int32, Utf8) | YES |
| public | iox | cpu | time | Timestamp(Nanosecond, None) | NO |
| public | iox | cpu | usage | Float64 | YES |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
DataFusion / Rust can handle structured types
Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃
I would prefer to make it available via the query language first, because as @alamb mentioned it gives users a fair bit of capability out of the box. I think it is also easy for us to extend, and easy for users to adopt extensions, as they just write different SQL queries.
Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃
❤️
There are PRs up to resolve all sub-issues above, I'll summarize them here, as it would be easier to review them in the following sequence:
system._
tables to the query executor, and elevates the QueryLog
system.queries
table, which provides access via SQL queries to the QueryLog
on the query executorsystem.*
tables are available without the need for a debug
header/parameter in gRPC/HTTP requestssystem.parquet_files
table, which provides access via SQL queries to see parquet files associated with a given table/database
This is a parent issue for adding support for system tables to
influxdb3
.This is to provide system level debug information via queries such as those available in distributed versions of
influxdb3
:system.compactor
(only on pro): contains info on total number of LN files as well as their total size for each partitionsystem.queries
: contains a log of queries performed, including various pieces of info and state indicators for each querysystem.parquet_files
(new): list parquet files for a given database and tablesystem.tables
andsystem.partitions
, which are included in distributed version ofinfluxdb3
, are not relevant in monolith mainly because there is no concept of partitions here.It is worth noting that the SQL standard
information_schema
tables are already supported via DataFusion and can be accessed with queries likeSHOW TABLES
andSHOW COLUMNS FROM <table_name>
ininfluxdb3
today.