Think about partitioning the log table

mreithub commented 8 years ago

For large log tables it might make sense to transparently partition them (speeds up lookups and cleanup)

The recall_enable() function could then then get an additional parameter specifying the number of partitions you want (for the given logInterval). If the partition count is set to 1 (default), partitioning won't be enabled for that log table.

The trigger function has to dynamically create new partitions when needed (and could at the same time easily drop old ones, making manual cleanup() calls practically obsolete)

Special care has to be taken to decide on good partition start times as well as a good naming scheme for each of the partitioned tables.

mreithub commented 8 years ago

This feature though is something for the future (>1.0).

Also it might be a good idea (for performance reasons) to have separate trigger functions (one with partitioning, one without - maybe it's possible to have the partitioning one use (call) the simple one to avoid duplicate code)

mreithub commented 8 years ago

I noticed that there actually are two ways to partition the log data (since we store two timestamps for each entry):

by start time:
That one's the more straight forward one, but it doesn't necessarily help us with cleaning up the data (cleanup's done using the end timestamp - for good reasons), so after each cleanup call we'd have to check if there are any partitions are empty (and remove them). Worst case scenario would be having a huge number of almost empty partitions (a more advanced cleanup function might move those entries to a fallback partition though)
by end time:
With that one, the cleanup process is simple (simply drop outdated partitions), but log entries need to be moved to another partition when they become outdated.
There has to be a special partition where end-ts is NULL (but guess what, that's exactly the contents of the data table - plus information on when each entry were created)

Also: PostgreSQL honours check constraints on partitions to quickly discard the ones that can't possibly contain any data matching the query. If we're partitioning by end time, queries selecting by start time (which I guess will come up more often than those for the end timestamp) might not be able to take advantage of that.

To sum up, there's no one-size-fits-all solution for partitioning our log data. When using a naive version of start-time partitioning, we might lose old but still-existing log entries (but at least the start-time never changes). With end-time partitioning, there's no way to avoid moving entries between partitions when they change.

mreithub / pg_recall

Think about partitioning the log table #15