onemoredata / bagger

Massive log storage in PostgreSQL
BSD 2-Clause "Simplified" License
9 stars 0 forks source link

RFC: Storage Partition Layout and Management Proposal #99

Open einhverfr opened 7 months ago

einhverfr commented 7 months ago

Storage Layout of Data Partitions in Bagger

This issue is to track discussion on a design proposal for handling the data partition tables for V1 of Bagger. If accepted, this will be the basis for a new design document.

Servermap Change Management and Per-Generation Schemas

The first draft of this proposal sought to do away with generation-based schemas. Unfortunately this is probably not possible. The problem is that as servermaps change the location of copies may also change. Being able to filter these out based on time is also not possible since messages may be delayed. For this reason, we will need to have versioned schemas.

I propose that partition schemas have a name of: the following information from the first host noted in the copies section of the servermap: "partitions_" followed by hostport followed by and the servermap id, all with special characters substituted for underscores.

Example: partitions_storage_1_5432_1 on both systems where Schaufel is writing.

Temporal Partitioning

OMD's Bagger software makes certain assumptions about how data structure changes can be made while in operation. For Version 1, we assume hourly partitioning is the only possible timeframe for partitioning. Future versions may revisit that question depending on use cases and customer requirements.

This means that tables are partitioned by hour, one table per hour of data, with other restrictions being applied based on specified dimensions. Indexes are created based on specified index settings. For version 1, the assumption is made that these values are known and are immutable at the time of table creation. Future versions may allow backfilling of partitioning and index operations but this is not expected to be supported in version 1.

The default storage strategy however will be assigned when the table is created. Because this affects storage and performance, it affects all new tables created but not existing tables. This doesn't affect read queries in any way and therefore doesn't have to be strictly immutable.

Tables will be created on demand in a schema following the naming convention above, based on an abstract empty table which houses the insert trigger. Indexes will then be detected and created. This happens after partition detection, which also requires some degree of detection at the time of routing. The easy way to do this is to check each partition dimension against the full array of possible dimensions both for temporal validity and values, and then create the routing based on that. This requires that the number of stale (and preferably live!) dimensions remains small. If that changes, we may need a new strategy. We may want to provide for purging old stale data that has expired beyond the point of data expiration too. Schaufel would write to an insert table in the schema noted and data would be routed within the same schema.

Open question:

Does the convention of data[dim1][dim2}...yyyy_mm_dd_hh work for table names?

Temporal management of table structures

Dimensions and indexes are assumed to become valid at some point in the future at a partition boundary, while at the same time they must also only become invalid in the future at the end of a partition. For this reason partition and dimension data needs not be checked on every request but could be done so, for example, when the first table of a given hour is created. The current software expectation is that at least one full hour will pass before the dimension or index can take effect, and therefore everything should be pre-loaded.

In the future, we may want to allow backfill_targets and backfilled_to to be tracked for indexes and partitioning as well to make it easier for the query program if we ever allow backfilling but this is not done yet.

Pre-creation of Tables

Precreation of tables would be optional but it would be important to configure Bagger to defer validity appropriately. These settings exist.

Retention Handling

Periodically, once an hour, we expect a retention process to sweep the database and remove all tables beyond the expired time. This should be done in a non-blocking way.

On query

On query, the querying process must:

  1. Compile a list of dimensions whose validity overlaps the query request
  2. Divide this into multiple ranges with different querydimensions if needed (this will usually not be needed, but could be) and then
  3. Locate tables whose time ranges overlap the query period
  4. Run he query on all these tables.

I propose this be assisted by view (possibly with a materialized component) which would allow for easier collapse of table data by dimension.

Anyway: Thoughts?

einhverfr commented 7 months ago

Editing this because I just figured out why we still need different schemas for different generations (in this case we will use servermap.id)