Closed hiltontj closed 4 months ago
We can return it as a string like you have it stored right now. So store as bytes, but return as string?
So store as bytes, but return as string?
That would keep it simple.
Datafusion has a scalar function for encoding binary data. So, I guess one could do
SELECT encode(_series_id, hex) FROM measurement;
If we didn't want to do the encoding for them.
Is the idea, though, to only return _series_id
if it is explicitly queried?
i.e., in a SELECT *
, it wouldn't be returned...
Yeah, I think only returning the _series_id
column if it is explicitly queried is maybe the better user experience. Generally they don't care what that is or that it exists, it's an optimization to make sorting/dedupe faster and to enable us to have indexes later on. So it's really for internal use.
I think we would need to augment the SQL/InfluxQL planner code to omit the _series_id
column in the projection, unless specified in the SELECT clause. @mgattozzi - you may gain some insight on how to do this while working on #24822
Cross posting slack discussion: https://influxdata.slack.com/archives/C04PASKL88P/p1712322573281069
Cross posting slack discussion: https://influxdata.slack.com/archives/C04PASKL88P/p1712322573281069
Since that discussion is private, I will summarize here:
We considered the alternative of using a u64
for the _series_id
column, instead of storing a byte array, i.e., [u8; 32]
, with a SHA256 hash.
A u64
could provide performance gains during ingest/query, because it takes up less space, is faster to compute, and would likely be faster for lookups/indexing. However, it would not be advantageous during persistence, as we would not be able to rely on it for de-duplication due to the possibility of collisions.
We will proceed with the [u8; 32]
, and see if we can incorporate the use of FixedSizeBinary
as a column type to store the _series_id
.
TableBuffer
with the capability to support FixedSizeBinary
column dataIt is trivial to add a new data type to the TableBuffer
, by extending its Builder
enum with a FixedSizeBinary
builder, and refactoring the necessary code. This code will compile, and will allow us to write; however, this will lead to run-time panics when the TableBuffer
attempts to produce RecordBatch
es when queried.
Here is the call stack for the culprit of the panic, TableBuffer::record_batches
, which going all the way up, is called from the DataFusion TableProvider
implementation of QueryTable
during physical query execution:
The issue we run into is with the Schema
type from influxdb3_core::schema
crate. TableBuffer::record_batches
pulls out the inner arrow schema, takes the column data written, and produces arrow RecordBatch
es using built-in arrow functionality. The Schema
definition, which represents the InfluxDB v2 data model, does not have the _series_id
column supported, and therefore, the creation of RecordBatch
es panics, due to a mismatch in the new column type being written and what is in the schema.
Edit: see below.
The options I am considering at this point:
InfluxColumnType
) in the schema
crate with a SeriesId
schema
TableBuffer
can inject the _series_id
column to the schema when producing RecordBatch
esI did spend some time hacking on (1.), and I don't think this is a good direction. The issue with (1.) is that it has many downstream implications in iox
when we extend the data model; some of the required code changes are non-trivial and controversial:
primary_key
functionality of Schema
, which would create big problems in iox
.mutable_batch
and arrow_util
crates for supporting a new data type would be challenging to implement while not being relevant to us in influxdb3
, and would therefore be introducing changes to iox
for no immediate reason.It may be possible that we extend the data model with a new field type to support the fixed size binary array. It would just be a field, though, so I don't know that this gives us much improvement over what we are doing now by storing it as a String
field.
I need to better understand how deeply embedded the schema::Schema
is in the querying side of things to decide if (2.) or (3.) are worth considering. Specifically:
Schema
provide other than the ability to produce record batches as it is doing for the TableBuffer
?schema::Schema
with something else hinder our ability to rely on iox_query
for the SQL/InfluxQL query planning code that we currently rely on?I think option #3 should be easy enough to implement. Have a look at my test indexing branch, I produce a record batch with a projected schema here: https://github.com/influxdata/influxdb/blob/pd/buffer-indexed/influxdb3_write/src/write_buffer/table_buffer.rs#L223-L224
You could inject whatever schema you want when you're putting together the record batch.
The only problem with _series_id being a field is that we want any deduplication to use that and time, not the tags. But that change would absolutely require big changes to iox_query to work that way. So in advance of that maybe using a field for _series_id would work. It's possible to get the query engine to not dedupe by having the buffer chunk return false here: https://github.com/influxdata/influxdb/blob/8f59f935c5961297d9cffc3a18f2bd8d78b36f44/influxdb3_write/src/chunk.rs#L44-L46
@pauldix - there is something I overlooked: although it is not too hard to add support to our TableBuffer
for the FixedSizeBinary
data type to allow writes, the reason for the panic when producing record batches is because I am still initializing the column for _series_id
as a String
. That happens here: https://github.com/influxdata/influxdb/blob/8f59f935c5961297d9cffc3a18f2bd8d78b36f44/influxdb3_write/src/write_buffer/mod.rs#L547
However, because there is no ColumnType
, i.e., InfluxColumnType
, for the byte array, we don't have a way to initialize it correctly. Adding it as a field might be a shortcut solution then.
Either way, I will likely spend a bit more time understanding the current Schema
, but will then try out the schema projection approach you linked there, and see where I can get with that before trying to extend the InfluxColumnType
again.
Yeah, you wouldn't do a projection, but just construct a new ArrowSchema struct and insert the series id column into it.
Extend the data model (InfluxColumnType) in the schema crate with a SeriesId
I agree this is messy and will have many non trivial implications, but I think that is inevitable if you change how the data model works.
In other words, I don't think it is just how the code is structured -- there are deep assumptions about the data model (in InfluxQL for example, and the deduplication)
if the goal is to hack something working I think option 3 ("See if the TableBuffer can inject the _series_id column to the schema when producing RecordBatches") is a reasonable idea
I have been doing some experimentation on (3.). Here is a commit that: 1) adds support for the SeriesId
/FixedSizeBinary
into the TableBuffer
, and 2) injects the _series_id
field into the schema that is used to produce RecordBatch
es in TableBuffer::record_batches
(here).
The server compiles and runs with this change, and accepts writes. Unfortunately, queries are broken after this change. Any query that selects the _series_id
column, including SELECT *
, breaks during execution, with an internal Datafusion error:
Arrow error: Invalid argument error: column types must match schema types, expected Utf8 but found FixedSizeBinary(32) at column index 1
Queries that do not select _series_id
work just fine, e.g., SELECT time, host, usage FROM cpu
.
I think the reason is that the BufferChunk
and ParquetChunk
produced by the SegmentState
are still using the original schema::Schema
(the iox one, not the arrow one), which cannot have the correct column type for the _series_id
. This causes the error sometime during execution when RecordBatch
es are bing produced (have been having some difficulty pinpointing where specifically the batches are being produced, but the error looks to be from here).
Regardless, I don't know that we can change this, because the Buffer/ParquetChunk
types need to implement QueryChunk
from iox_query
, which needs the schema::Schema
.
My intention today is to go down the path of adding the Fixed Size Binary as a field type in the schema
crate, see whether or not it can work, and see what implications that has on its dependencies. That is, it will be a new variant on InfluxFieldType
, vs. on InfluxColumnType
.
My reason for doing this is that it would give us the ability to store _series_id
in the desired format in Parquet, vs. a String. This should resolve the issues above that are due to us not having a InfluxColumnType
that can represent the data type we are writing, while (hopefully) not affecting any of the intricacies of the existing data model, and its use of tags/time for deduplication, etc. (to your point re: "the deep assumptions about the data model" @alamb)
My feeling is that we cannot fully leverage the _series_id
(as described in the parent issue) without building a new data model around it, but this at least gives us the ability to store _series_id
in Parquet in such a way that is forward compatible. We can hopefully implement the desired deduplication capabilities on top of this without having to modify iox_query
. We may have to implement our own create_chunk_statistics
, for example.
I have been able to make some progress on this, which I have captured in a couple of PRs, and will summarize here.
core
(https://github.com/influxdata/influxdb3_core/pull/12)I took another stab at adding a new InfluxColumnType
variant to support the SeriesId
. I opened a PR to show these changes in core
here: https://github.com/influxdata/influxdb3_core/pull/12. As mentioned in the PR, I took the approach of making the changes in iox
first, ensuring that cargo test --workspace
still passed, and then sync'ing them to core
.
The changes there blatantly ignore some of the heavy lifting. For example, the changes that might be required in the mutable_batch
crate to support the new column type were bypassed via using unimplemented!
. This may be a risk, because we obviously don't want to introduce panics, however, since cargo test --workspace
is fully green, I am less worried (this link is private, but the test results can be seen in the iox repo here).
influxdb
(https://github.com/influxdata/influxdb/pull/24902)The above changes made supporting the binary column type in influxdb
fairly simple. You can see that in this PR: https://github.com/influxdata/influxdb/pull/24902
There are test failures in that PR, but they are no longer for internal Datafusion issues during query execution (see the PR description for more detail). This also eliminated the need to fabricate the schema when producing RecordBatch
s from the table buffer.
_series_id
column on parquet file size.arrow-rs
regarding the formatting of FixedSizeBinary
columns in JSON writer (https://github.com/apache/arrow-rs/issues/5620)SeriesId
is ignored as a primary key / for deduplication, experiment with changing that in core
(I am following along -- I think it sounds like you are on the right track for evaluating how much of a change adding a new column would be)
I have some preliminary results comparing the parquet file size with and without the _series_id
column.
TL;DR: what I found was that the _series_id
, when stored as FixedSizeBinary(32)
, increases the size of a parquet file by 10.93 bytes / row of data in the file.
one_mil
I ran two scenarios:
no_series_id
- using a release version of the influxdb3
binary with the _series_id
column completely removed. That is on this branch: https://github.com/influxdata/influxdb/tree/hiltontj/no-series-id-col-experimentwith_series_id
- using a release version of the influxdb3
binary with the _series_id
, from the branch on the above PR, i.e., https://github.com/influxdata/influxdb/tree/hiltontj/seriesid-col-experimentFor each test I did the following:
influxdb3
binary with a file-based object store, and a completely clean data directorywrite
mode with the following specifications:
one_mil
- generates high-cardinality measurement data using a single tag1m
5m
5
With a segment duration of 1m
and test duration of 5m
, this allowed for a few segments to be persisted while the load generator was running at full steam, and the resulting parquet files to be consistent in row count between the two tests.
No series ID | With Series ID | Delta | |
---|---|---|---|
Avg. File Size (B) | 203,091,950.75 | 235,875,544.00 | |
Std. Dev. | 4,664.87 | 9,019.07 | |
Rows / File | 3,000,000 | 3,000,000 | |
Avg. Size / Row (B) | 67.70 | 78.63 | +10.93 |
one_mil_modified
I ran the same scenarios through the same set of steps as above, but this time, with a modified version of the one_mil
spec:
{
"name": "one_mil_modified",
"measurements": [
{
"name": "data",
"tags": [
{
"key": "series_number",
"value": "sn-",
"cardinality": 1000000
}
],
"fields": [
{
"key": "bool_val",
"bool": true
}
],
"lines_per_sample": 10000
}
]
}
Like one_mil
, it generates a high-cardinality measurement using a single tag, but I trimmed down the field data to a single boolean field to reduce the extraneous data in the parquet files being generated.
No series ID | With Series ID | Delta | |
---|---|---|---|
Avg. File Size (B) | 3,132,351.25 | 35,922,848.25 | |
Std. Dev. | 116.45 | 140.73 | |
Rows / File | 3,000,000 | 3,000,000 | |
Avg. Size / Row (B) | 1.04 | 11.97 | +10.93 |
one_mil_modified
test (~1,000%), but lesser so in the one_mil
test (~16%)._series_id
hashes, I suppose there is some compression taking place.In addition to the results in my previous comment, here is an analysis of the metadata of one of the parquet files with a _series_id
.
Most notably,
_series_id
is consistent with the above (10.93 B)Row group 0
, it is 11.963 MB compressed compared to 44.386 MB uncompressed_series_id
as bytesFile path: dbs/load_test/data/2024-04-10T21-06/4294967291.parquet
Created by: parquet-rs version 50.0.0
Properties:
<omitted>
Schema:
message arrow_schema {
required fixed_len_byte_array(32) _series_id;
optional boolean bool_val;
optional binary series_number (STRING);
required int64 time (TIMESTAMP(NANOS,false));
}
Row group 0: count: 1048576 11.96 B records start: 4 total(compressed): 11.963 MB total(uncompressed):44.386 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB_ 1048576 10.93 B "0x0000572D0275C43641EB6E3..." / "0xFFFFCB8073353DD1DC049D5..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.89 B "sn-1" / "sn-41457"
time INT64 Z BB_ 1048576 0.02 B "2024-04-10T21:06:00.43700..." / "2024-04-10T21:06:59.53900..."
Row group 1: count: 1048576 11.95 B records start: 12544291 total(compressed): 11.945 MB total(uncompressed):44.386 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB_ 1048576 10.93 B "0x00000E57F40EDC157F80CE5..." / "0xFFFFF80F8A7337A42F29FEE..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.87 B "sn-41457" / "sn-729142"
time INT64 Z BB_ 1048576 0.02 B "2024-04-10T21:06:00.54000..." / "2024-04-10T21:06:59.53900..."
Row group 2: count: 902848 12.01 B records start: 25070056 total(compressed): 10.339 MB total(uncompressed):38.008 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB_ 902848 10.93 B "0x0000892CDD00C27F7F8F966..." / "0xFFFF2E54CFEEBD2140004A1..."
bool_val BOOLEAN Z RB_ 902848 0.13 B "false" / "true"
series_number BINARY Z RB_ 902848 0.93 B "sn-729142" / "sn-999999"
time INT64 Z BB_ 902848 0.02 B "2024-04-10T21:06:00.54000..." / "2024-04-10T21:06:59.53900..."
I plan to run a couple more tests:
I ran the test against a release build of influxdb3
that stores the _series_id
in a string column. Below is the parquet file meta analysis.
Most notably:
_series_id
is 0.5 B larger than the bytes version (11.44 B vs. 10.93 B)_series_id
as String
File path: dbs/load_test/data/2024-04-11T14-13/4294967292.parquet
Created by: parquet-rs version 50.0.0
Properties:
<omitted>
Schema:
message arrow_schema {
optional binary _series_id (STRING);
optional boolean bool_val;
optional binary series_number (STRING);
required int64 time (TIMESTAMP(NANOS,false));
}
Row group 0: count: 1048576 12.47 B records start: 4 total(compressed): 12.471 MB total(uncompressed):78.436 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id BINARY Z RB_ 1048576 11.44 B "0000572d0275c43641eb6e30c..." / "ffffcb8073353dd1dc049d592..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.89 B "sn-1" / "sn-41457"
time INT64 Z BB_ 1048576 0.02 B "2024-04-11T14:13:00.94800..." / "2024-04-11T14:13:59.94600..."
Row group 1: count: 1048576 12.46 B records start: 13076946 total(compressed): 12.458 MB total(uncompressed):78.436 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id BINARY Z RB_ 1048576 11.44 B "00000e57f40edc157f80ce5fd..." / "fffff80f8a7337a42f29feeda..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.87 B "sn-41457" / "sn-729142"
time INT64 Z BB_ 1048576 0.02 B "2024-04-11T14:13:00.94500..." / "2024-04-11T14:13:59.94600..."
Row group 2: count: 902848 12.53 B records start: 26140643 total(compressed): 10.790 MB total(uncompressed):67.053 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id BINARY Z RB_ 902848 11.45 B "0000892cdd00c27f7f8f96614..." / "ffff2e54cfeebd2140004a143..."
bool_val BOOLEAN Z RB_ 902848 0.13 B "false" / "true"
series_number BINARY Z RB_ 902848 0.93 B "sn-729142" / "sn-999999"
time INT64 Z BB_ 902848 0.02 B "2024-04-11T14:13:00.84400..." / "2024-04-11T14:13:59.94600..."
The one_mil
data set with that configuration represents the worst case for compression of _series_id
as you have only 3M rows and 1M unique values. It would very interesting to see what this looks like with a much more modest cardinality (3M rows and 100k unique values for instance. Also, to have the parquet data sorted by _series_id, not by tags, which should have a big impact on the size, particularly if you have the writer use RLE for compressing the column. You may need to do quite a bit of fiddling around with things to test this last case. However, that would be the most representative as we intend to actually sort the Parquet files by series id.
I ran a test where the load data generated a measurement with a cardinality of one. This is the spec that was used:
{
"name": "cardinality_one",
"measurements": [
{
"name": "data",
"tags": [
{
"key": "cardinality",
"value": "one"
}
],
"fields": [
{
"key": "bool_val",
"bool": true
}
],
"lines_per_sample": 10000
}
]
}
Below is the parquet meta analysis, but notably,
_series_id
is ~1 BFile path: dbs/load_test/data/2024-04-11T14-46/4294967291.parquet
Created by: parquet-rs version 50.0.0
Properties:
<omitted>
Schema:
message arrow_schema {
required fixed_len_byte_array(32) _series_id;
optional boolean bool_val;
optional binary cardinality (STRING);
required int64 time (TIMESTAMP(NANOS,false));
}
Row group 0: count: 137 10.12 B records start: 4 total(compressed): 1.354 kB total(uncompressed):5.756 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB_ 137 1.03 B "0xFDED0EE165FEB5DD31AFB61..." / "0xFDED0EE165FEB5DD31AFB61..."
bool_val BOOLEAN Z RB_ 137 0.45 B "false" / "true"
cardinality BINARY Z RB_ 137 0.59 B "one" / "one"
time INT64 Z BB_ 137 8.04 B "2024-04-11T14:46:00.17700..." / "2024-04-11T14:46:59.27900..."
It would very interesting to see what this looks like with a much more modest cardinality
@pauldix - in my recent comment, dialing down the cardinality does drastically decrease the average size of the _series_id
(albeit, in this case to a cardinality of one is obviously too modest).
I will run some more modest versions of the one_mil
spec as you suggest, e.g., 1k cardinality, 10k, 100k, etc., and post the results.
After that, I will start to think about implementing the sort by _series_id
and RLE compression.
I ran the same meta analysis for a single tag data set producing the following cardinalities: 1k, 10k, 100k, and 1M.
The sampling interval and writer count (Nwriters) were varied in each test to ensure that a resulting parquet file would have a full row group, i.e., containing 1,048,576 rows, from which I would analyze the stats.
Cardinality | 1k | 10k | 100k | 1M |
---|---|---|---|---|
Nwriters | 10 | 5 | 5 | 5 |
Sampling Interval (ms) | 50 | 100 | 100 | 1,000 |
Lines per sample | 1,000 | 1,000 | 1,000 | 10,000 |
Nrows in Row Group | 1,048,576 | 1,048,576 | 1,048,576 | 1,048,576 |
Avg. Size of _series_id (B) |
0.04 | 0.12 | 1.13 | 10.93 |
Overhead of _series_id (MB) |
0.04 | 0.13 | 1.18 | 11.46 |
Compressed Row Group Size (MB) | 0.23 | 0.282 | 1.372 | 11.963 |
Uncompressed Row Group Size (MB) | 33.663 | 33.564 | 33.919 | 44.386 |
_EDIT: I added the Overhead of _series_id
row, which is calculated as *Nrows Avg. Size of _series_id
**_
_series_id
looks directly proportional to the cardinality in the file, within this rangeNot captured by this table, but can be seen, for example, in the full meta analysis above, is that individual row-groups cover different ranges of the available cardinality. The coverage of each row-group varies in extent, but the average size of the _series_id
is consistent across all row groups in the file. This is the case for all test runs.
I think the above analysis is promising. The fact that the impact on file size is proportional to the cardinality should work in our favour; namely, sorting by _series_id
should reduce the cardinality of each file, thereby reducing the impact of the _series_id
on file size.
I am going to move onto #24909 to investigate how we go about doing the sorting, as well as introducing RLE for the _series_id
column.
If we like what we are seeing here, then we can tackle the issue of how we enable the FixedSizeBinary
column type, for real. I think the PRs I have opened on this issue serve as a solid starting point, but we need to make sure we are introducing it safely to IOx, so will definitely need a bit more work.
One idea I had, was to introduce the changes in IOx/core
with a feature flag, that we enable explicitly in influxdb3
, but is disabled by default in IOx. That way, any risk associated with the unimplemented!
s I have added will be absent in IOx.
@pauldix - cross-posting some results from using non-standard encoding on the _series_id
column here: https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512. Looks promising.
I ran the same meta analysis for a single tag data set producing the following cardinalities: 1k, 10k, 100k, and 1M.
@hiltontj , do you have numbers for the overhead of the size of series_id
for the table in https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203?
Specifically, how much larger is a the file with the (extra) series_id
column that without / what percentage of the storage space is taken up by series_id
?
@alamb - see below.
do you have numbers for the overhead of the size of series_id for the table in https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203?
I updated https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203 to include a row that shows the calculated overhead. It looks like this corresponds to the compressed overhead, though the parquet-cli
tool is not explicit about that.
I think I am correct in saying that the overhead would be the average size of the _series_id
multiplied by the number of rows. Since the average size looks to be a function of cardinality, then the overhead would also be a function of cardinality.
Specifically, how much larger is a the file with the (extra) series_id column that without / what percentage of the storage space is taken up by series_id?
https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512 shows the with/without comparison side-by-side.
In that case, even the encoded version, which is substantially better than the non-encoded version, is 2-3x larger than the file without the _series_id
. However, keep in mind that is relative; the data in those tests has the schema:
message arrow_schema {
required fixed_len_byte_array(32) _series_id;
optional boolean bool_val;
optional binary series_number (STRING);
required int64 time (TIMESTAMP(NANOS,false));
}
There is only one boolean
field in addition to the single tag (series_number
), time
, and the _series_id
.
This is how the individual rows stack up, for the row group that I pulled numbers from:
Row group 1: count: 1048576 1.43 B records start: 1485652 total(compressed): 1.427 MB total(uncompressed):4.775 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB 1048576 1.19 B "0x0000E38A0B081FF3E80EDDF..." / "0xFFFEF19DD886C08A83A523F..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.09 B "sn-41454" / "sn-72912"
time INT64 Z BB_ 1048576 0.02 B "2024-04-11T21:06:00.06400..." / "2024-04-11T21:06:59.96400..."
I have avoided presenting the overhead as a relative difference in total file size, because the relative impact of the _series_id
would be less to a data set containing many fields than it would be to a data set containing few.
I wanted to provide one more summary here, that compares the difference in file size with an encoded _series_id
column and without the _series_id
column, for two cardinalities: 10k and 100k.
This re-uses the data from https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512.
Below are the results, as well as the full row group stats for each test/cardinality pair.
Test | No SID | SID + Sort + Enc | No SID | SID + Sort + Enc |
---|---|---|---|---|
Cardinality | 10k | 10k | 100k | 100k |
Avg. Size of _series_id (B) |
- | 0.11 | - | 1.19 |
Overhead of _series_id (MB)1 |
- | 0.12 | - | 1.25 |
Compressed Row Group Size (MB) | 0.156 | 0.269 | 0.25 | 1.427 |
Uncompressed Row Group Size (MB) | 1.561 | 2.417 | 1.916 | 4.775 |
Parameter | Value |
---|---|
Nwriters | 5 |
Sampling Interval (ms) | 100 |
Lines per sample | 1,000 |
Nrows in Row Group | 1,048,576 |
Row group 0: count: 1048576 0.15 B records start: 4 total(compressed): 155.515 kB total(uncompressed):1.561 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.01 B "sn-1" / "sn-4143"
time INT64 Z BB_ 1048576 0.02 B "2024-04-12T14:54:00.04700..." / "2024-04-12T14:54:59.94900..."
Row group 0: count: 1048576 0.26 B records start: 4 total(compressed): 268.692 kB total(uncompressed):2.417 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB 1048576 0.11 B "0x0011164692CD3CDDE71F0FC..." / "0xFFEB1A524B9350EA888990E..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.01 B "sn-1" / "sn-4143"
time INT64 Z BB_ 1048576 0.02 B "2024-04-12T14:46:00.07500..." / "2024-04-12T14:46:59.97500..."
Row group 0: count: 1048576 0.24 B records start: 4 total(compressed): 250.379 kB total(uncompressed):1.916 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.09 B "sn-1" / "sn-41454"
time INT64 Z BB_ 1048576 0.03 B "2024-04-12T00:21:00.04300..." / "2024-04-12T00:21:59.94500..."
Row group 1: count: 1048576 1.43 B records start: 1485652 total(compressed): 1.427 MB total(uncompressed):4.775 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB 1048576 1.19 B "0x0000E38A0B081FF3E80EDDF..." / "0xFFFEF19DD886C08A83A523F..."
bool_val BOOLEAN Z RB_ 1048576 0.13 B "false" / "true"
series_number BINARY Z RB_ 1048576 0.09 B "sn-41454" / "sn-72912"
time INT64 Z BB_ 1048576 0.02 B "2024-04-11T21:06:00.06400..." / "2024-04-11T21:06:59.96400..."
1 - Overhead of _series_id
is calculated as Average Size of _series_id
multiplied by Nrows
Test: SID + Sort + Enc, Cardinality: 100k
So if I am reading these values correctly, in the high cardinality case, of storing the series id requires around 5x the space compared to the rest of the data in the benchmark (1.19B/row vs 0.24B per row for everything else)
I think a more realistic dataset is to have a float_val
column which I expect would take up more than the bool_val column.
In any event, my conclusion is that in may cases the overhead of storing a _series_id
column will be substantial
@alamb - here is an example using a float instead of a boolean:
Row group 0: count: 1048576 3.84 B records start: 4 total(compressed): 3.839 MB total(uncompressed):7.159 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
_series_id FIXED[32] Z BB 1048576 1.17 B "0x000382C6FC1F91A82EC716A..." / "0xFFFE7C8DACBE5533A1BF59C..."
float_val DOUBLE Z RB_ 1048576 2.56 B "-0.0" / "100.0"
series_number BINARY Z RB_ 1048576 0.09 B "sn-1" / "sn-41454"
time INT64 Z BB_ 1048576 0.01 B "2024-04-12T18:21:00.48800..." / "2024-04-12T18:21:59.58900..."
However, this is a random float between 0 and 100, which will compress terribly, and therefore, in practise, a float column might be considerably smaller.
The load generator can only do random or fixed floats right now, neither of which is realistic, so, it may be worth having a mode that produces more deterministic output.
Thanks @hiltontj -- I still think it is a good data point -- specifically the series_is will likely take approximately as much space as a float field
The
_series_id
column added as part of #24815 is currently stored as aString
column. It would be more ideal if we could store this as bytes directly, however, doing so is not currently supported in the InfluxDB 2.0 data model.There are a few dependencies from
influxdb3_core
where this may need to be addressed:data_types::ColumnType
schema::InfluxFieldType
Questions/Considerations
_series_id
column represented (if at all) in queries when stored as bytes?