influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.28k stars 3.52k forks source link

Support system tables #24972

Open hiltontj opened 2 months ago

hiltontj commented 2 months ago

This is a parent issue for adding support for system tables to influxdb3.

This is to provide system level debug information via queries such as those available in distributed versions of influxdb3:

system.tables and system.partitions, which are included in distributed version of influxdb3, are not relevant in monolith mainly because there is no concept of partitions here.

It is worth noting that the SQL standard information_schema tables are already supported via DataFusion and can be accessed with queries like SHOW TABLES and SHOW COLUMNS FROM <table_name> in influxdb3 today.

### Tasks
- [x] Move traits such as `IoxSystemTable` and `SystemTableProvider` and any other shareable code from `iox` into `influxdb3_core` (https://github.com/influxdata/influxdb3_core/pull/18)
- [ ] https://github.com/influxdata/influxdb/issues/24986
- [ ] https://github.com/influxdata/influxdb/issues/24987
- [ ] https://github.com/influxdata/influxdb/issues/24989
- [ ] https://github.com/influxdata/influxdb/issues/24988
pauldix commented 2 months ago

Partitions doesn't make sense to bring over as OSS doesn't use the same scheme. We'll want to have a system table for listing parquet files I think.

hiltontj commented 2 months ago

I was wondering about that since partitions are really a distributed thing. It looks like we do fabricate a partition ID, but I guess that is mostly to satisfy the QueryChunk trait?

Either way, as long as there is no compatibility issues, I suppose we could have a system.parquet_files provider in lieu of system.partitions and system.tables (which seems useless in absence of system.partitions)

pauldix commented 2 months ago

Yeah, the partition ID is just a stub to satisfy the trait.

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

alamb commented 2 months ago

For the system.parquet_files table, I think it would be good to limit the scope of what the user can query. Ideally, it would require a database name and table name so that only the parquet files for a single table would be returned. Otherwise, there's too much potential to return a massive amount of stuff. I'm not sure if there's a precedent for that anywhere but @alamb might have some thoughts on what would make sense for this.

Distributed doesn't enable system tables unless you set a special debug flag to avoid the potential issue of returning a massive amount of stuff.

One thing you could do is basically require a table_name = 'foo' and a databse_name = 'foo' predicate to be present

That would be implemented by

  1. looking at the _filter Exprs passed to the TableProvider::scan https://github.com/influxdata/influxdb_iox/blob/ca887899c73b57b710b371da1ca79fafced1831b/querier/src/system_tables/mod.rs#L145-L150
  2. Looking for those special predicates (special casing BinaryExpr)
  3. Erroring if they can't be cound with a message like "you must have a table_name='foo' predicate to query parquet_files" or something
pauldix commented 2 months ago

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency. Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

alamb commented 2 months ago

@alamb I'm looking at the system tables as an API for users to find out information like schema, the parquet files that exist, and other information about the data itself. So it's ultimately something I'd like in both Monolith and Distributed for consistency.

Makes sense to me

Are system tables the right place to put this? We could have a REST API to get at this information instead, but I thought that just having it in the query language might be easier for users.

I think there are different opinions on this topic

I personally like system tables as then you have immediate filtering/joining/aggregation via SQL without any additional tools

However, I believe others prefer REST APIs so that it is easier to use other tools for analysis.

hiltontj commented 2 months ago

Having it in the query language would expose it via the /query API, which may not be as convenient as dedicated endpoints, but is still workable. We could then hook up dedicated REST endpoints if people desire them.

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with. That is not the case with the current system tables in iox though, and I don't think would be for listing out parquet file info.

alamb commented 2 months ago

I guess one issue would be if the info we want returned is not tabular, then representing it in a SQL response may be problematic. In which case, REST would be better to begin with.

That is a good point -- DataFusion / Rust can handle structured types (e.g StructArray, MapArray and ListArray) but the support is definitely not a full featured as it could be

hiltontj commented 2 months ago

To add some detail, the SHOW TABLES and SHOW COLUMNS queries, which are already supported, provide the following:

influxdb3 query -d foo 'SHOW TABLES'
+---------------+--------------------+-------------+------------+
| table_catalog | table_schema       | table_name  | table_type |
+---------------+--------------------+-------------+------------+
| public        | iox                | cpu         | BASE TABLE |
| public        | iox                | mem         | BASE TABLE |
| public        | information_schema | tables      | VIEW       |
| public        | information_schema | views       | VIEW       |
| public        | information_schema | columns     | VIEW       |
| public        | information_schema | df_settings | VIEW       |
| public        | information_schema | schemata    | VIEW       |
+---------------+--------------------+-------------+------------+
influxdb3 query -d foo 'SHOW COLUMNS FROM cpu'
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| table_catalog | table_schema | table_name | column_name | data_type                   | is_nullable |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
| public        | iox          | cpu        | host        | Dictionary(Int32, Utf8)     | YES         |
| public        | iox          | cpu        | time        | Timestamp(Nanosecond, None) | NO          |
| public        | iox          | cpu        | usage       | Float64                     | YES         |
+---------------+--------------+------------+-------------+-----------------------------+-------------+
hiltontj commented 2 months ago

DataFusion / Rust can handle structured types

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

I would prefer to make it available via the query language first, because as @alamb mentioned it gives users a fair bit of capability out of the box. I think it is also easy for us to extend, and easy for users to adopt extensions, as they just write different SQL queries.

alamb commented 2 months ago

Right, well in that case, I suppose we are covered there - structured type support will be improved in DF if we need it 😃

❤️

hiltontj commented 2 months ago

There are PRs up to resolve all sub-issues above, I'll summarize them here, as it would be easier to review them in the following sequence: