timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
371 stars 46 forks source link

`topn()` fails on large number of `NULL`s #767

Open SimonMTS opened 1 year ago

SimonMTS commented 1 year ago

Relevant system information:

Describe the bug Using topn() on a large number of NULL rows causes an Option::unwrap() error. On a smaller number of rows it works as expected.

To Reproduce Steps to reproduce the behavior:

  1. Start the image timescale/timescaledb-ha:pg15-all. (This is how I reproduced it, but it probably happens on all recent versions)
  2. Run the following SQL to setup the database:

    CREATE TABLE public.example (time TIMESTAMPTZ NOT NULL, string_value TEXT NULL);
    
    INSERT INTO public.example (time, string_value)
       SELECT time, NULL AS string_value
       FROM generate_series(now() - interval '5 day', now(), interval '1 second') AS g1(time);
  3. Run the following query to trigger the bug: SELECT topn(mcv_agg(1, string_value)) FROM public.example;
  4. See the error ERROR: called 'Option::unwrap()' on a 'None' value

Expected behavior On smaller (NULL) datasets topn() results in:

 topn
------
(0 rows)

I expect this behaviour not to change depending on dataset size.

Actual behavior On larger (NULL) datasets topn() results in:

ERROR:  called `Option::unwrap()` on a `None` value

Additional context The bug can be reproduced on Timescale Cloud and locally in docker.

This bug was particularly pernicious because it only occurs on larger datasets. So we didn't encounter it during testing.

Workaround A Timescale Support agent quickly provided us with a workaround. Replacing string_value with COALESCE(string_value,''::text)), e.g. SELECT topn(mcv_agg(1, COALESCE(string_value,''::text))) FROM public.example;

Which changes the behaviour to return an empty string instead of no results. But did solve our production issue.

supergoudvis116 commented 5 months ago

Cool story bro, no one seem to care.

vanderhoop commented 2 weeks ago

FWIW, I don't think this is always related to the presence of NULL values. We're encountering the same error—called 'Option::unwrap()' on a 'None' value—when aggregating with mcv_agg on a non-null column. The aggregation seems to work fine on small datasets but chokes when we expand the time window.

We've tried the COALESCE trick mentioned in the summary, and no dice.

Given the issue appears similar, I would suggest changing the title of this issue to drop the reference to NULLs. Else, I can create a new issue.

vanderhoop commented 2 weeks ago

We're encountering the same error—called 'Option::unwrap()' on a 'None' value—when aggregating with mcv_agg on a non-null column. The aggregation seems to work fine on small datasets but blows up when we expand the time window.

Okay, so we found a workaround for our case. When trying to aggregate with a lower bound and no upper bound, we were running into the called 'Option::unwrap()' on a 'None' value error as we expanded the time window. By supplying an upper bound of time_column <= now(), the issue disappeared.

Hopefully this helps track down the root cause.

vanderhoop commented 2 weeks ago

Okay, so we found a workaround for our case. ... By supplying an upper bound of time_column <= now(), the issue disappeared.

Okay, so more information here 😄

Namely, the lower bound is also important here. Our users can select from preset time windows, and if they select a window whose lower bound is far beyond the first record for a given query, the mcv_agg call blows up with the 'Option::unwrap()' on a 'None' value error. For example, imagine the user selects a time window of 3 months, but there are only records for the last 2.5 months for their data. (Boom!)

So based on our (minimal) experiments, it seems that the creation of the mcv_agg breaks unless the bounds are scoped to where there is sure to be data. We can work around that for now, but it would make sense to make mcv_agg more resilient to these cases in the long term.