timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
16.89k stars 853 forks source link

[Bug]: Correlation on chunks do not sum up into hypertable statistics #6779

Closed igor2x closed 2 months ago

igor2x commented 3 months ago

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Other

What happened?

In most of the cases in our biggest hypertable I expect we have mainly inserts. Using pg_stat_activity I have seen plenty of updates on table and now I was wondering how are data clustered.

SELECT correlation
FROM pg_stats
WHERE schemaname = 'my_schema' AND tablename = 'my_hypertable' and attname = 'time'

Correlation:

I except for my time column to be near 1 in my case it is 0.9991473.

Now I want to see if all of the chunks have similar correlation value.

SELECT
    AVG(correlation) as correlation
FROM
    pg_stats
WHERE
    schemaname = '_timescaledb_internal' AND
    tablename LIKE '_hyper_21_%_chunk' AND
    attname = 'time';

and I expect to get similar value near 1, but to my surprise I have got: 0.030021765771206122. I know both values can't be exactly the same, because chunks have different number of rows, but I expect to be both values near 1.

To look in to details by chunks, sorting data by correlation column:

SELECT
    schemaname,
    tablename,
    attname as tabel_column,
    correlation
FROM
    pg_stats
WHERE
    schemaname = '_timescaledb_internal' AND
    tablename LIKE '_hyper_21_%_chunk' AND
    attname = 'time'
ORDER BY
    correlation desc
;
schemaname tablename tabel_column correlation
_timescaledb_internal _hyper_21_2259_chunk time 0.33352998
_timescaledb_internal _hyper_21_3168_chunk time 0.24424154
_timescaledb_internal _hyper_21_3291_chunk time 0.24273652
_timescaledb_internal _hyper_21_34857_chunk time 0.24208757
_timescaledb_internal _hyper_21_3174_chunk time 0.24178517
_timescaledb_internal _hyper_21_35967_chunk time 0.24155508
_timescaledb_internal _hyper_21_3128_chunk time 0.24059464

then few hundreds chunks and at the end:

schemaname tablename tabel_column correlation
_timescaledb_internal _hyper_21_35216_chunk time -0.14595592
_timescaledb_internal _hyper_21_35233_chunk time -0.15703407
_timescaledb_internal _hyper_21_35226_chunk time -0.16049138
_timescaledb_internal _hyper_21_35213_chunk time -0.16416267
_timescaledb_internal _hyper_21_35217_chunk time -0.16622631
_timescaledb_internal _hyper_21_35191_chunk time -0.17012595
_timescaledb_internal _hyper_21_35215_chunk time -0.17249262
_timescaledb_internal _hyper_21_35223_chunk time -0.1738908
_timescaledb_internal _hyper_21_35220_chunk time -0.17461224

I think this is unexpected if hypertable has 0.9991473 value, I expect individual chunks correlations should be near this value (or more correctly average of chunks correlations should be near hypertable correlation value).

TimescaleDB version affected

2.14.2

PostgreSQL version used

15.6

What operating system did you use?

Red Hat 9.3

What installation method did you use?

RPM

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

1. Create hypertable.
2. Insert some rows.
3. Execute few updates
4. Compare correlation on hypertable vs. average correlation of chunks
igor2x commented 3 months ago

Additional info:

  1. I see partition index on time is by default in descending order, so expected value on hypertable level should be near -1.
  2. For one particular chunk where correlation was near 0,2 I executed: CLUSTER my_chunk USING index_on_time_column;
  3. I have rechecked the correlation on time column on now clustered chunk and I have got: -0.9999673. This is very near -1 and this is pretty much expected value.

Conclusions:

  1. It looks like on hypertable correlations not updated. Maybe this is the value when first time PostgreSQL table was created and before it was converted to hypertable. Why, I assume value near 1, is not usual for TimescaleDB, it should be near -1, because of descending order of index.
  2. It looks like, to my surprise, data in my chunks are really super "fragmented" (non-clustered) and I need to cluster them, to increase performance.
mkindahl commented 3 months ago

Hello @igor2x and thanks for the bug report.

It looks like, to my surprise, data in my chunks are really super "fragmented" (non-clustered) and I need to cluster them, to increase performance.

Chunks are not automatically clustered in any way just because they are chunks. You have to run cluster on them explicitly to get the correlation back. If you do a lot of updates on the table, correlation will go down because the new row versions are added last in the data file, which affects correlation.

However, running cluster and then analyze should get the correlation statistics back.

svenklemm commented 2 months ago

Statistics are tracked on a chunk level and do not propagate to the hypertable. The stats present on the hypertable are most likely from before turning it in hypertable since the hypertable relation itself stores no tuples and would have no stats. This is similar to how tables in a postgres inheritance tree work.