Support column type for storing bytes

influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics

https://influxdata.com

Apache License 2.0

28.65k stars 3.54k forks source link

Support column type for storing bytes #24845

Closed hiltontj closed 4 months ago

hiltontj commented 5 months ago

The _series_id column added as part of #24815 is currently stored as a String column. It would be more ideal if we could store this as bytes directly, however, doing so is not currently supported in the InfluxDB 2.0 data model.

There are a few dependencies from influxdb3_core where this may need to be addressed:

Questions/Considerations

How is the _series_id column represented (if at all) in queries when stored as bytes?

pauldix commented 5 months ago

We can return it as a string like you have it stored right now. So store as bytes, but return as string?

hiltontj commented 5 months ago

So store as bytes, but return as string?

That would keep it simple.

Datafusion has a scalar function for encoding binary data. So, I guess one could do

SELECT encode(_series_id, hex) FROM measurement;

If we didn't want to do the encoding for them.

Is the idea, though, to only return _series_id if it is explicitly queried?

i.e., in a SELECT *, it wouldn't be returned...

pauldix commented 5 months ago

Yeah, I think only returning the _series_id column if it is explicitly queried is maybe the better user experience. Generally they don't care what that is or that it exists, it's an optimization to make sorting/dedupe faster and to enable us to have indexes later on. So it's really for internal use.

hiltontj commented 5 months ago

I think we would need to augment the SQL/InfluxQL planner code to omit the _series_id column in the projection, unless specified in the SELECT clause. @mgattozzi - you may gain some insight on how to do this while working on #24822

alamb commented 5 months ago

Cross posting slack discussion: https://influxdata.slack.com/archives/C04PASKL88P/p1712322573281069

hiltontj commented 5 months ago

Cross posting slack discussion: https://influxdata.slack.com/archives/C04PASKL88P/p1712322573281069

Since that discussion is private, I will summarize here:

We considered the alternative of using a u64 for the _series_id column, instead of storing a byte array, i.e., [u8; 32], with a SHA256 hash.

A u64 could provide performance gains during ingest/query, because it takes up less space, is faster to compute, and would likely be faster for lookups/indexing. However, it would not be advantageous during persistence, as we would not be able to rely on it for de-duplication due to the possibility of collisions.

We will proceed with the [u8; 32], and see if we can incorporate the use of FixedSizeBinary as a column type to store the _series_id.

hiltontj commented 5 months ago

Investigation: Extend the `TableBuffer` with the capability to support `FixedSizeBinary` column data

It is trivial to add a new data type to the TableBuffer, by extending its Builder enum with a FixedSizeBinary builder, and refactoring the necessary code. This code will compile, and will allow us to write; however, this will lead to run-time panics when the TableBuffer attempts to produce RecordBatches when queried.

Here is the call stack for the culprit of the panic, TableBuffer::record_batches, which going all the way up, is called from the DataFusion TableProvider implementation of QueryTable during physical query execution:

The issue we run into is with the Schema type from influxdb3_core::schema crate. TableBuffer::record_batches pulls out the inner arrow schema, takes the column data written, and produces arrow RecordBatches using built-in arrow functionality. The Schema definition, which represents the InfluxDB v2 data model, does not have the _series_id column supported, and therefore, the creation of RecordBatches panics, due to a mismatch in the new column type being written and what is in the schema.

Edit: see below.

The options I am considering at this point:

Extend the data model (InfluxColumnType) in the schema crate with a SeriesId
Implement our own, or create a fork of, schema
See if the TableBuffer can inject the _series_id column to the schema when producing RecordBatches
Open to suggestions

I did spend some time hacking on (1.), and I don't think this is a good direction. The issue with (1.) is that it has many downstream implications in iox when we extend the data model; some of the required code changes are non-trivial and controversial:

Arguably, we would need to augment the primary_key functionality of Schema, which would create big problems in iox.
Changes required extend to the mutable_batch and arrow_util crates for supporting a new data type would be challenging to implement while not being relevant to us in influxdb3, and would therefore be introducing changes to iox for no immediate reason.
I believe the changes need to account for nullability, which is a property that we really don’t want this field to have, and also makes things tricky w.r.t. the first bullet point.

It may be possible that we extend the data model with a new field type to support the fixed size binary array. It would just be a field, though, so I don't know that this gives us much improvement over what we are doing now by storing it as a String field.

I need to better understand how deeply embedded the schema::Schema is in the querying side of things to decide if (2.) or (3.) are worth considering. Specifically:

[ ] What pieces of functionality does the Schema provide other than the ability to produce record batches as it is doing for the TableBuffer?
[ ] Would supplanting the use of schema::Schema with something else hinder our ability to rely on iox_query for the SQL/InfluxQL query planning code that we currently rely on?
[ ] How much work would (2.) entail?

pauldix commented 5 months ago

I think option #3 should be easy enough to implement. Have a look at my test indexing branch, I produce a record batch with a projected schema here: https://github.com/influxdata/influxdb/blob/pd/buffer-indexed/influxdb3_write/src/write_buffer/table_buffer.rs#L223-L224

You could inject whatever schema you want when you're putting together the record batch.

The only problem with _series_id being a field is that we want any deduplication to use that and time, not the tags. But that change would absolutely require big changes to iox_query to work that way. So in advance of that maybe using a field for _series_id would work. It's possible to get the query engine to not dedupe by having the buffer chunk return false here: https://github.com/influxdata/influxdb/blob/8f59f935c5961297d9cffc3a18f2bd8d78b36f44/influxdb3_write/src/chunk.rs#L44-L46

hiltontj commented 5 months ago

@pauldix - there is something I overlooked: although it is not too hard to add support to our TableBuffer for the FixedSizeBinary data type to allow writes, the reason for the panic when producing record batches is because I am still initializing the column for _series_id as a String. That happens here: https://github.com/influxdata/influxdb/blob/8f59f935c5961297d9cffc3a18f2bd8d78b36f44/influxdb3_write/src/write_buffer/mod.rs#L547

However, because there is no ColumnType, i.e., InfluxColumnType, for the byte array, we don't have a way to initialize it correctly. Adding it as a field might be a shortcut solution then.

Either way, I will likely spend a bit more time understanding the current Schema, but will then try out the schema projection approach you linked there, and see where I can get with that before trying to extend the InfluxColumnType again.

pauldix commented 5 months ago

Yeah, you wouldn't do a projection, but just construct a new ArrowSchema struct and insert the series id column into it.

alamb commented 5 months ago

Extend the data model (InfluxColumnType) in the schema crate with a SeriesId

I agree this is messy and will have many non trivial implications, but I think that is inevitable if you change how the data model works.

In other words, I don't think it is just how the code is structured -- there are deep assumptions about the data model (in InfluxQL for example, and the deduplication)

if the goal is to hack something working I think option 3 ("See if the TableBuffer can inject the _series_id column to the schema when producing RecordBatches") is a reasonable idea

hiltontj commented 5 months ago

I have been doing some experimentation on (3.). Here is a commit that: 1) adds support for the SeriesId/FixedSizeBinary into the TableBuffer, and 2) injects the _series_id field into the schema that is used to produce RecordBatches in TableBuffer::record_batches (here).

The server compiles and runs with this change, and accepts writes. Unfortunately, queries are broken after this change. Any query that selects the _series_id column, including SELECT *, breaks during execution, with an internal Datafusion error:

Arrow error: Invalid argument error: column types must match schema types, expected Utf8 but found FixedSizeBinary(32) at column index 1

Queries that do not select _series_id work just fine, e.g., SELECT time, host, usage FROM cpu.

I think the reason is that the BufferChunk and ParquetChunk produced by the SegmentState are still using the original schema::Schema (the iox one, not the arrow one), which cannot have the correct column type for the _series_id. This causes the error sometime during execution when RecordBatches are bing produced (have been having some difficulty pinpointing where specifically the batches are being produced, but the error looks to be from here).

Regardless, I don't know that we can change this, because the Buffer/ParquetChunk types need to implement QueryChunk from iox_query, which needs the schema::Schema.

hiltontj commented 5 months ago

My intention today is to go down the path of adding the Fixed Size Binary as a field type in the schema crate, see whether or not it can work, and see what implications that has on its dependencies. That is, it will be a new variant on InfluxFieldType, vs. on InfluxColumnType.

My reason for doing this is that it would give us the ability to store _series_id in the desired format in Parquet, vs. a String. This should resolve the issues above that are due to us not having a InfluxColumnType that can represent the data type we are writing, while (hopefully) not affecting any of the intricacies of the existing data model, and its use of tags/time for deduplication, etc. (to your point re: "the deep assumptions about the data model" @alamb)

My feeling is that we cannot fully leverage the _series_id (as described in the parent issue) without building a new data model around it, but this at least gives us the ability to store _series_id in Parquet in such a way that is forward compatible. We can hopefully implement the desired deduplication capabilities on top of this without having to modify iox_query. We may have to implement our own create_chunk_statistics, for example.

hiltontj commented 5 months ago

I have been able to make some progress on this, which I have captured in a couple of PRs, and will summarize here.

Changes to IOx/`core` (https://github.com/influxdata/influxdb3_core/pull/12)

I took another stab at adding a new InfluxColumnType variant to support the SeriesId. I opened a PR to show these changes in core here: https://github.com/influxdata/influxdb3_core/pull/12. As mentioned in the PR, I took the approach of making the changes in iox first, ensuring that cargo test --workspace still passed, and then sync'ing them to core.

The changes there blatantly ignore some of the heavy lifting. For example, the changes that might be required in the mutable_batch crate to support the new column type were bypassed via using unimplemented!. This may be a risk, because we obviously don't want to introduce panics, however, since cargo test --workspace is fully green, I am less worried (this link is private, but the test results can be seen in the iox repo here).

Changes to `influxdb` (https://github.com/influxdata/influxdb/pull/24902)

The above changes made supporting the binary column type in influxdb fairly simple. You can see that in this PR: https://github.com/influxdata/influxdb/pull/24902

There are test failures in that PR, but they are no longer for internal Datafusion issues during query execution (see the PR description for more detail). This also eliminated the need to fabricate the schema when producing RecordBatchs from the table buffer.

Next Steps

I will keep these changes on their respective experimental branches and use them to conduct some tests, e.g., the impact of the _series_id column on parquet file size.
Open an issue in arrow-rs regarding the formatting of FixedSizeBinary columns in JSON writer (https://github.com/apache/arrow-rs/issues/5620)
The SeriesId is ignored as a primary key / for deduplication, experiment with changing that in core

alamb commented 5 months ago

popcorn-eating (I am following along -- I think it sounds like you are on the right track for evaluating how much of a change adding a new column would be)

hiltontj commented 5 months ago

I have some preliminary results comparing the parquet file size with and without the _series_id column.

TL;DR: what I found was that the _series_id, when stored as FixedSizeBinary(32), increases the size of a parquet file by 10.93 bytes / row of data in the file.

Test: `one_mil`

Setup

I ran two scenarios:

no_series_id - using a release version of the influxdb3 binary with the _series_id column completely removed. That is on this branch: https://github.com/influxdata/influxdb/tree/hiltontj/no-series-id-col-experiment
with_series_id - using a release version of the influxdb3 binary with the _series_id, from the branch on the above PR, i.e., https://github.com/influxdata/influxdb/tree/hiltontj/seriesid-col-experiment

For each test I did the following:

start the influxdb3 binary with a file-based object store, and a completely clean data directory
start the load generator tool in write mode with the following specifications:
- spec: the built-in one_mil - generates high-cardinality measurement data using a single tag
- segment duration: 1m
- test duration: 5m
- writer count: 5
once the load test completes, save the entire data directory somewhere for analysis, then wipe it for the next test scenario

With a segment duration of 1m and test duration of 5m, this allowed for a few segments to be persisted while the load generator was running at full steam, and the resulting parquet files to be consistent in row count between the two tests.

Results

	No series ID	With Series ID	Delta
Avg. File Size (B)	203,091,950.75	235,875,544.00
Std. Dev.	4,664.87	9,019.07
Rows / File	3,000,000	3,000,000
Avg. Size / Row (B)	67.70	78.63	+10.93

Test: `one_mil_modified`

Setup

I ran the same scenarios through the same set of steps as above, but this time, with a modified version of the one_mil spec:

{
  "name": "one_mil_modified",
  "measurements": [
    {
      "name": "data",
      "tags": [
        {
          "key": "series_number",
          "value": "sn-",
          "cardinality": 1000000
        }
      ],
      "fields": [
        {
          "key": "bool_val",
          "bool": true
        }
      ],
      "lines_per_sample": 10000
    }
  ]
}

Like one_mil, it generates a high-cardinality measurement using a single tag, but I trimmed down the field data to a single boolean field to reduce the extraneous data in the parquet files being generated.

Results

	No series ID	With Series ID	Delta
Avg. File Size (B)	3,132,351.25	35,922,848.25
Std. Dev.	116.45	140.73
Rows / File	3,000,000	3,000,000
Avg. Size / Row (B)	1.04	11.97	+10.93

Discussion

The increase in average size / row in bytes was consistent between both tests, which is somewhat re-assuring.
I don't know what we would deem to be acceptable as an increase. 11 bytes per row constitutes a large relative increase in file size in the one_mil_modified test (~1,000%), but lesser so in the one_mil test (~16%).
11 bytes per row is better than 32 bytes per row, which is what it takes to store the _series_id hashes, I suppose there is some compression taking place.

hiltontj commented 5 months ago

In addition to the results in my previous comment, here is an analysis of the metadata of one of the parquet files with a _series_id.

Most notably,

The average size per _series_id is consistent with the above (10.93 B)
This analysis shows the compression factor at the row-group level, e.g., for Row group 0, it is 11.963 MB compressed compared to 44.386 MB uncompressed

Parquet Meta-Analysis when storing the `_series_id` as bytes

File path:  dbs/load_test/data/2024-04-10T21-06/4294967291.parquet
Created by: parquet-rs version 50.0.0
Properties:
  <omitted>
Schema:
message arrow_schema {
  required fixed_len_byte_array(32) _series_id;
  optional boolean bool_val;
  optional binary series_number (STRING);
  required int64 time (TIMESTAMP(NANOS,false));
}

Row group 0:  count: 1048576  11.96 B records  start: 4  total(compressed): 11.963 MB total(uncompressed):44.386 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB_     1048576   10.93 B          "0x0000572D0275C43641EB6E3..." / "0xFFFFCB8073353DD1DC049D5..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.89 B             "sn-1" / "sn-41457"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-10T21:06:00.43700..." / "2024-04-10T21:06:59.53900..."

Row group 1:  count: 1048576  11.95 B records  start: 12544291  total(compressed): 11.945 MB total(uncompressed):44.386 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB_     1048576   10.93 B          "0x00000E57F40EDC157F80CE5..." / "0xFFFFF80F8A7337A42F29FEE..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.87 B             "sn-41457" / "sn-729142"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-10T21:06:00.54000..." / "2024-04-10T21:06:59.53900..."

Row group 2:  count: 902848  12.01 B records  start: 25070056  total(compressed): 10.339 MB total(uncompressed):38.008 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB_     902848    10.93 B          "0x0000892CDD00C27F7F8F966..." / "0xFFFF2E54CFEEBD2140004A1..."
bool_val       BOOLEAN   Z RB_     902848    0.13 B             "false" / "true"
series_number  BINARY    Z RB_     902848    0.93 B             "sn-729142" / "sn-999999"
time           INT64     Z BB_     902848    0.02 B             "2024-04-10T21:06:00.54000..." / "2024-04-10T21:06:59.53900..."

hiltontj commented 5 months ago

I plan to run a couple more tests:

[x] run the one_mil_modified test with the _series_id stored as a string, and see if there is a difference in the average size per _series_id (results)
[x] try a spec that produces a measurement with no (or very low) cardinality, to see if the compression factor improves (results)

hiltontj commented 5 months ago

I ran the test against a release build of influxdb3 that stores the _series_id in a string column. Below is the parquet file meta analysis.

Most notably:

The average size of each _series_id is 0.5 B larger than the bytes version (11.44 B vs. 10.93 B)
The uncompressed file size is almost twice as large when using strings as in the previous meta analysis of the bytes version (78 MB vs. 44 MB).

Parquet Meta-Analysis for `_series_id` as `String`


File path:  dbs/load_test/data/2024-04-11T14-13/4294967292.parquet
Created by: parquet-rs version 50.0.0
Properties:
  <omitted>
Schema:
message arrow_schema {
  optional binary _series_id (STRING);
  optional boolean bool_val;
  optional binary series_number (STRING);
  required int64 time (TIMESTAMP(NANOS,false));
}

Row group 0:  count: 1048576  12.47 B records  start: 4  total(compressed): 12.471 MB total(uncompressed):78.436 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     BINARY    Z RB_     1048576   11.44 B            "0000572d0275c43641eb6e30c..." / "ffffcb8073353dd1dc049d592..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.89 B             "sn-1" / "sn-41457"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-11T14:13:00.94800..." / "2024-04-11T14:13:59.94600..."

Row group 1:  count: 1048576  12.46 B records  start: 13076946  total(compressed): 12.458 MB total(uncompressed):78.436 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     BINARY    Z RB_     1048576   11.44 B            "00000e57f40edc157f80ce5fd..." / "fffff80f8a7337a42f29feeda..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.87 B             "sn-41457" / "sn-729142"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-11T14:13:00.94500..." / "2024-04-11T14:13:59.94600..."

Row group 2:  count: 902848  12.53 B records  start: 26140643  total(compressed): 10.790 MB total(uncompressed):67.053 MB 
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     BINARY    Z RB_     902848    11.45 B            "0000892cdd00c27f7f8f96614..." / "ffff2e54cfeebd2140004a143..."
bool_val       BOOLEAN   Z RB_     902848    0.13 B             "false" / "true"
series_number  BINARY    Z RB_     902848    0.93 B             "sn-729142" / "sn-999999"
time           INT64     Z BB_     902848    0.02 B             "2024-04-11T14:13:00.84400..." / "2024-04-11T14:13:59.94600..."

pauldix commented 5 months ago

The one_mil data set with that configuration represents the worst case for compression of _series_id as you have only 3M rows and 1M unique values. It would very interesting to see what this looks like with a much more modest cardinality (3M rows and 100k unique values for instance. Also, to have the parquet data sorted by _series_id, not by tags, which should have a big impact on the size, particularly if you have the writer use RLE for compressing the column. You may need to do quite a bit of fiddling around with things to test this last case. However, that would be the most representative as we intend to actually sort the Parquet files by series id.

hiltontj commented 5 months ago

I ran a test where the load data generated a measurement with a cardinality of one. This is the spec that was used:

{
  "name": "cardinality_one",
  "measurements": [
    {
      "name": "data",
      "tags": [
        {
          "key": "cardinality",
          "value": "one"
        }
      ],
      "fields": [
        {
          "key": "bool_val",
          "bool": true
        }
      ],
      "lines_per_sample": 10000
    }
  ]
}

Below is the parquet meta analysis, but notably,

the average size of each _series_id is ~1 B
the volume of data was several orders of magnitude less than the previous load tests

Parquet meta-analysis for cardinality of one

File path:  dbs/load_test/data/2024-04-11T14-46/4294967291.parquet
Created by: parquet-rs version 50.0.0
Properties:
  <omitted>
Schema:
message arrow_schema {
  required fixed_len_byte_array(32) _series_id;
  optional boolean bool_val;
  optional binary cardinality (STRING);
  required int64 time (TIMESTAMP(NANOS,false));
}

Row group 0:  count: 137  10.12 B records  start: 4  total(compressed): 1.354 kB total(uncompressed):5.756 kB 
--------------------------------------------------------------------------------
             type      encodings count     avg size   nulls   min / max
_series_id   FIXED[32] Z BB_     137       1.03 B           "0xFDED0EE165FEB5DD31AFB61..." / "0xFDED0EE165FEB5DD31AFB61..."
bool_val     BOOLEAN   Z RB_     137       0.45 B             "false" / "true"
cardinality  BINARY    Z RB_     137       0.59 B             "one" / "one"
time         INT64     Z BB_     137       8.04 B             "2024-04-11T14:46:00.17700..." / "2024-04-11T14:46:59.27900..."

hiltontj commented 5 months ago

It would very interesting to see what this looks like with a much more modest cardinality

@pauldix - in my recent comment, dialing down the cardinality does drastically decrease the average size of the _series_id (albeit, in this case to a cardinality of one is obviously too modest).

I will run some more modest versions of the one_mil spec as you suggest, e.g., 1k cardinality, 10k, 100k, etc., and post the results.

After that, I will start to think about implementing the sort by _series_id and RLE compression.

hiltontj commented 5 months ago

I ran the same meta analysis for a single tag data set producing the following cardinalities: 1k, 10k, 100k, and 1M.

The sampling interval and writer count (N_writers) were varied in each test to ensure that a resulting parquet file would have a full row group, i.e., containing 1,048,576 rows, from which I would analyze the stats.

Results

Cardinality	1k	10k	100k	1M
N_writers	10	5	5	5
Sampling Interval (ms)	50	100	100	1,000
Lines per sample	1,000	1,000	1,000	10,000
N_rows in Row Group	1,048,576	1,048,576	1,048,576	1,048,576
Avg. Size of `_series_id` (B)	0.04	0.12	1.13	10.93
Overhead of `_series_id` (MB)	0.04	0.13	1.18	11.46
Compressed Row Group Size (MB)	0.23	0.282	1.372	11.963
Uncompressed Row Group Size (MB)	33.663	33.564	33.919	44.386

_EDIT: I added the Overhead of _series_id row, which is calculated as *N_rows Avg. Size of _series_id**_

Observations

Average size of _series_id looks directly proportional to the cardinality in the file, within this range
Compressed row group size is about the same for 1k and 10k cardinality, increasing proportionately to cardinality beyond that
Uncompressed row group size is about the same for 1k, 10k, and 100k cardinality, and the increases for 1M

Not captured by this table, but can be seen, for example, in the full meta analysis above, is that individual row-groups cover different ranges of the available cardinality. The coverage of each row-group varies in extent, but the average size of the _series_id is consistent across all row groups in the file. This is the case for all test runs.

hiltontj commented 5 months ago

I think the above analysis is promising. The fact that the impact on file size is proportional to the cardinality should work in our favour; namely, sorting by _series_id should reduce the cardinality of each file, thereby reducing the impact of the _series_id on file size.

I am going to move onto #24909 to investigate how we go about doing the sorting, as well as introducing RLE for the _series_id column.

If we like what we are seeing here, then we can tackle the issue of how we enable the FixedSizeBinary column type, for real. I think the PRs I have opened on this issue serve as a solid starting point, but we need to make sure we are introducing it safely to IOx, so will definitely need a bit more work.

One idea I had, was to introduce the changes in IOx/core with a feature flag, that we enable explicitly in influxdb3, but is disabled by default in IOx. That way, any risk associated with the unimplemented!s I have added will be absent in IOx.

hiltontj commented 5 months ago

@pauldix - cross-posting some results from using non-standard encoding on the _series_id column here: https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512. Looks promising.

alamb commented 5 months ago

I ran the same meta analysis for a single tag data set producing the following cardinalities: 1k, 10k, 100k, and 1M.

@hiltontj , do you have numbers for the overhead of the size of series_id for the table in https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203?

Specifically, how much larger is a the file with the (extra) series_id column that without / what percentage of the storage space is taken up by series_id?

hiltontj commented 5 months ago

@alamb - see below.

do you have numbers for the overhead of the size of series_id for the table in https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203?

I updated https://github.com/influxdata/influxdb/issues/24845#issuecomment-2050158203 to include a row that shows the calculated overhead. It looks like this corresponds to the compressed overhead, though the parquet-cli tool is not explicit about that.

I think I am correct in saying that the overhead would be the average size of the _series_id multiplied by the number of rows. Since the average size looks to be a function of cardinality, then the overhead would also be a function of cardinality.

Specifically, how much larger is a the file with the (extra) series_id column that without / what percentage of the storage space is taken up by series_id?

https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512 shows the with/without comparison side-by-side.

In that case, even the encoded version, which is substantially better than the non-encoded version, is 2-3x larger than the file without the _series_id. However, keep in mind that is relative; the data in those tests has the schema:

message arrow_schema {
  required fixed_len_byte_array(32) _series_id;
  optional boolean bool_val;
  optional binary series_number (STRING);
  required int64 time (TIMESTAMP(NANOS,false));
}

There is only one boolean field in addition to the single tag (series_number), time, and the _series_id.

This is how the individual rows stack up, for the row group that I pulled numbers from:

Row group 1:  count: 1048576  1.43 B records  start: 1485652  total(compressed): 1.427 MB total(uncompressed):4.775 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB      1048576   1.19 B           "0x0000E38A0B081FF3E80EDDF..." / "0xFFFEF19DD886C08A83A523F..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.09 B             "sn-41454" / "sn-72912"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-11T21:06:00.06400..." / "2024-04-11T21:06:59.96400..."

I have avoided presenting the overhead as a relative difference in total file size, because the relative impact of the _series_id would be less to a data set containing many fields than it would be to a data set containing few.

hiltontj commented 5 months ago

I wanted to provide one more summary here, that compares the difference in file size with an encoded _series_id column and without the _series_id column, for two cardinalities: 10k and 100k.

This re-uses the data from https://github.com/influxdata/influxdb/issues/24909#issuecomment-2050592512.

Below are the results, as well as the full row group stats for each test/cardinality pair.

Results

Test	No SID	SID + Sort + Enc	No SID	SID + Sort + Enc
Cardinality	10k	10k	100k	100k
Avg. Size of `_series_id` (B)	-	0.11	-	1.19
Overhead of `_series_id` (MB)¹	-	0.12	-	1.25
Compressed Row Group Size (MB)	0.156	0.269	0.25	1.427
Uncompressed Row Group Size (MB)	1.561	2.417	1.916	4.775

Constants

Parameter	Value
N_writers	5
Sampling Interval (ms)	100
Lines per sample	1,000
N_rows in Row Group	1,048,576

Row Group Stats

Test: No SID, Cardinality: 10k

Row group 0:  count: 1048576  0.15 B records  start: 4  total(compressed): 155.515 kB total(uncompressed):1.561 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.01 B             "sn-1" / "sn-4143"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-12T14:54:00.04700..." / "2024-04-12T14:54:59.94900..."

Test: SID + Sort + Enc, Cardinality: 10k

Row group 0:  count: 1048576  0.26 B records  start: 4  total(compressed): 268.692 kB total(uncompressed):2.417 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB      1048576   0.11 B           "0x0011164692CD3CDDE71F0FC..." / "0xFFEB1A524B9350EA888990E..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.01 B             "sn-1" / "sn-4143"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-12T14:46:00.07500..." / "2024-04-12T14:46:59.97500..."

Test: No SID, Cardinality: 100k

Row group 0:  count: 1048576  0.24 B records  start: 4  total(compressed): 250.379 kB total(uncompressed):1.916 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.09 B             "sn-1" / "sn-41454"
time           INT64     Z BB_     1048576   0.03 B             "2024-04-12T00:21:00.04300..." / "2024-04-12T00:21:59.94500..."

Test: SID + Sort + Enc, Cardinality: 100k

Row group 1:  count: 1048576  1.43 B records  start: 1485652  total(compressed): 1.427 MB total(uncompressed):4.775 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB      1048576   1.19 B           "0x0000E38A0B081FF3E80EDDF..." / "0xFFFEF19DD886C08A83A523F..."
bool_val       BOOLEAN   Z RB_     1048576   0.13 B             "false" / "true"
series_number  BINARY    Z RB_     1048576   0.09 B             "sn-41454" / "sn-72912"
time           INT64     Z BB_     1048576   0.02 B             "2024-04-11T21:06:00.06400..." / "2024-04-11T21:06:59.96400..."

¹ - Overhead of _series_id is calculated as Average Size of _series_id multiplied by N_rows

alamb commented 5 months ago

Test: SID + Sort + Enc, Cardinality: 100k

So if I am reading these values correctly, in the high cardinality case, of storing the series id requires around 5x the space compared to the rest of the data in the benchmark (1.19B/row vs 0.24B per row for everything else)

I think a more realistic dataset is to have a float_val column which I expect would take up more than the bool_val column.

In any event, my conclusion is that in may cases the overhead of storing a _series_id column will be substantial

hiltontj commented 5 months ago

@alamb - here is an example using a float instead of a boolean:

Row group 0:  count: 1048576  3.84 B records  start: 4  total(compressed): 3.839 MB total(uncompressed):7.159 MB
--------------------------------------------------------------------------------
               type      encodings count     avg size   nulls   min / max
_series_id     FIXED[32] Z BB      1048576   1.17 B           "0x000382C6FC1F91A82EC716A..." / "0xFFFE7C8DACBE5533A1BF59C..."
float_val      DOUBLE    Z RB_     1048576   2.56 B             "-0.0" / "100.0"
series_number  BINARY    Z RB_     1048576   0.09 B             "sn-1" / "sn-41454"
time           INT64     Z BB_     1048576   0.01 B             "2024-04-12T18:21:00.48800..." / "2024-04-12T18:21:59.58900..."

However, this is a random float between 0 and 100, which will compress terribly, and therefore, in practise, a float column might be considerably smaller.

The load generator can only do random or fixed floats right now, neither of which is realistic, so, it may be worth having a mode that produces more deterministic output.

alamb commented 5 months ago

Thanks @hiltontj -- I still think it is a good data point -- specifically the series_is will likely take approximately as much space as a float field

influxdata / influxdb

Support column type for storing bytes #24845

Questions/Considerations

Investigation: Extend the TableBuffer with the capability to support FixedSizeBinary column data

Changes to IOx/core (https://github.com/influxdata/influxdb3_core/pull/12)

Changes to influxdb (https://github.com/influxdata/influxdb/pull/24902)

Next Steps

Test: one_mil

Setup

Results

Test: one_mil_modified

Setup

Results

Discussion

Parquet Meta-Analysis when storing the _series_id as bytes

Parquet Meta-Analysis for _series_id as String

Parquet meta-analysis for cardinality of one

Results

Observations

Results

Constants

Row Group Stats

Test: No SID, Cardinality: 10k

Test: SID + Sort + Enc, Cardinality: 10k

Test: No SID, Cardinality: 100k

Test: SID + Sort + Enc, Cardinality: 100k

Investigation: Extend the `TableBuffer` with the capability to support `FixedSizeBinary` column data

Changes to IOx/`core` (https://github.com/influxdata/influxdb3_core/pull/12)

Changes to `influxdb` (https://github.com/influxdata/influxdb/pull/24902)

Test: `one_mil`

Test: `one_mil_modified`

Parquet Meta-Analysis when storing the `_series_id` as bytes

Parquet Meta-Analysis for `_series_id` as `String`