timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
385 stars 47 forks source link

Default collation does not work on Alpine #148

Open joostlammers opened 3 years ago

joostlammers commented 3 years ago

Relevant system information:

Describe the bug Getting the rows from our timescale db continuous Materialized view, we keep getting "ERROR: deserialization error invalid collation "pg_catalog"."en_US.utf8" CONTEXT: extension/src/hyperloglog.rs:106:5 SQL state: XX000"

To Reproduce Steps to reproduce the behavior:

  1. Create table:

    CREATE TABLE some_table ( time TIMESTAMP WITH TIME ZONE NOT NULL, countryCode VARCHAR(2), clientIp VARCHAR(40) );

  2. Create Materialized view:

    CREATE MATERIALIZED VIEW view_name WITH (timescaledb.continuous) AS SELECT time_bucket('01:00:00'::interval, some_table."time") AS "time", some_table.clientip, timescale_analytics_experimental.hyperloglog_count(timescale_analytics_experimental.hyperloglog(64, some_table.countryCode)) AS countrycodes, count(*) AS clientcalls FROM some_table GROUP BY (time_bucket('01:00:00'::interval, some_table."time")), some_table.countrycode, some_table.clientip;`

  3. Add some data to the table

  4. Try to grab some data from this view

    SELECT * FROM public.view_name

  5. Error appears

Additional information: Using PGAdmin to retrieve the code of the view and executing it partially, it appears it's caused within:

_timescaledb_internal.finalize_agg('timescale_analytics_experimental.hyperloglog(integer,anyelement)'::text, 'pg_catalog'::name, 'default'::name, '{{pg_catalog,int4},{pg_catalog,varchar}}'::name[], _materialized_hypertable_6.agg_4_4, NULL::timescale_analytics_experimental.hyperloglog) AS countrycodes,

just before the inner-join.

JLockerman commented 3 years ago

Thank you for the excellent bug report!

I was able to reproduce the bug locally, and will report back one we've determined the cause. (note for future repro'ers: you need to refresh the continuous aggregate once the data has been inserted to trigger the bug)

JLockerman commented 3 years ago

Simpler repro of the same issue: install the extension and run

SELECT timescale_analytics_experimental.hyperloglog_count('{
  "version":1,
  "element_type":"VARCHAR",
  "collation":["pg_catalog","en_US.utf8"],
  "b":6,
  "registers":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0]
}');

This will cause an error like

ERROR:  called `Result::unwrap()` on an `Err` value: Error("invalid collation \"pg_catalog\".\"en_US.utf8\"", line: 4, column: 41)
LINE 1: SELECT timescale_analytics_experimental.hyperloglog_count('{
                                                                  ^
CONTEXT:  extension/src/hyperloglog.rs:130:1

which I believe is caused by the same underlying issue

JLockerman commented 3 years ago

Interestingly the default collation for the database is en_US.utf8

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(3 rows)

but when I try to create a column collated on it

CREATE TABLE test (foo text collate "en_US.utf8");

I get an error

ERROR:  collation "en_US.utf8" for encoding "UTF8" does not exist
LINE 1: create table test (foo text collate "en_US.utf8");

Next I'm going to check if this problem exists in the upstream image out nightlies are based on.

JLockerman commented 3 years ago

It looks this issue is present in the current timescale/timescaledb:latest-pg12. I will try to open an error report there.

Error Summary

For stability across machines in multinode, which may have different OIDs, we serialize collations as a (namespace, name) pair. The default collation (OID 100), is special, as it does not refer to a "real" collation but the default for the database, and since this may differ across databases, we look in the database catalog to discover what the collation actually is, and serialize that. Unfortunately, in the current timescale/timescaledb:latest-pg12 the default database collation is set to en_US.utf8, a collation not supported by that version of the database, so when we try to deserialize this collation the DB complains that we're trying to use a collation it does not know about. For now we have two takeaways:

  1. We should fix this in the upstream TimescaleDB image, it shouldn't be defaulting to a collation that does not exist.
  2. We need to determine if we can detect and handle this edge case when serializing collations.
janfockaert commented 3 years ago

Thanks @JLockerman for the quick and good responses, is there a manual workaround that we can apply for now?

JLockerman commented 3 years ago

For a short-term fix I'm changing the nightly image to be based on our debian image in PR https://github.com/timescale/timescale-analytics/pull/149; this will make the collation things work, and make the nightly image more similar to the release image, and switch to a more-tested code paths, hopefully preventing other issues in the future.

Longer term, we're planning to switch our HLL implementation for an HLL++ implementation, and while we doing that we'll add some detection for this case (probably by ignoring the default collation a treating it like the C collation; it looks like Postgres guarantees that the default collation will be byte-wise compatible, and text_hash() ignores the default collation anyway).

JLockerman commented 3 years ago

@janfockaert switching to a different collation for the hyperloglog should work in the meantime, for instance

timescale_analytics_experimental.hyperloglog(buckets, data COLLATE "C")
janfockaert commented 3 years ago

Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?

JLockerman commented 3 years ago

note: it probably pays to detect the input collation and if it's deterministic (or default?) just use C.

JLockerman commented 3 years ago

Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?

Nope CI builds are currently broken. It works locally so I pushed a manual build.

JLockerman commented 3 years ago

Nightly builds should be fixed by PR https://github.com/timescale/timescale-analytics/pull/154

bikashthapacoding commented 2 years ago

Is it fixed ? or not I have same error

yaitskov commented 2 years ago

latest docker image produces the error on schema from TimescaleDb getting started page:

select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
ERROR:  deserialization error invalid collation "pg_catalog"."C.UTF-8"
CONTEXT:  extension/src/hyperloglog.rs:126:31