Open joostlammers opened 3 years ago
Thank you for the excellent bug report!
I was able to reproduce the bug locally, and will report back one we've determined the cause. (note for future repro'ers: you need to refresh the continuous aggregate once the data has been inserted to trigger the bug)
Simpler repro of the same issue: install the extension and run
SELECT timescale_analytics_experimental.hyperloglog_count('{
"version":1,
"element_type":"VARCHAR",
"collation":["pg_catalog","en_US.utf8"],
"b":6,
"registers":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0]
}');
This will cause an error like
ERROR: called `Result::unwrap()` on an `Err` value: Error("invalid collation \"pg_catalog\".\"en_US.utf8\"", line: 4, column: 41)
LINE 1: SELECT timescale_analytics_experimental.hyperloglog_count('{
^
CONTEXT: extension/src/hyperloglog.rs:130:1
which I believe is caused by the same underlying issue
Interestingly the default collation for the database is en_US.utf8
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+------------+------------+-----------------------
postgres | postgres | UTF8 | en_US.utf8 | en_US.utf8 |
template0 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | en_US.utf8 | en_US.utf8 | =c/postgres +
| | | | | postgres=CTc/postgres
(3 rows)
but when I try to create a column collated on it
CREATE TABLE test (foo text collate "en_US.utf8");
I get an error
ERROR: collation "en_US.utf8" for encoding "UTF8" does not exist
LINE 1: create table test (foo text collate "en_US.utf8");
Next I'm going to check if this problem exists in the upstream image out nightlies are based on.
It looks this issue is present in the current timescale/timescaledb:latest-pg12
. I will try to open an error report there.
For stability across machines in multinode, which may have different OIDs, we serialize collations as a (namespace, name)
pair. The default collation (OID 100), is special, as it does not refer to a "real" collation but the default for the database, and since this may differ across databases, we look in the database catalog to discover what the collation actually is, and serialize that. Unfortunately, in the current timescale/timescaledb:latest-pg12
the default database collation is set to en_US.utf8
, a collation not supported by that version of the database, so when we try to deserialize this collation the DB complains that we're trying to use a collation it does not know about.
For now we have two takeaways:
Thanks @JLockerman for the quick and good responses, is there a manual workaround that we can apply for now?
For a short-term fix I'm changing the nightly image to be based on our debian image in PR https://github.com/timescale/timescale-analytics/pull/149; this will make the collation things work, and make the nightly image more similar to the release image, and switch to a more-tested code paths, hopefully preventing other issues in the future.
Longer term, we're planning to switch our HLL implementation for an HLL++ implementation, and while we doing that we'll add some detection for this case (probably by ignoring the default collation a treating it like the C collation; it looks like Postgres guarantees that the default collation will be byte-wise compatible, and text_hash()
ignores the default collation anyway).
@janfockaert switching to a different collation for the hyperloglog should work in the meantime, for instance
timescale_analytics_experimental.hyperloglog(buckets, data COLLATE "C")
Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?
note: it probably pays to detect the input collation and if it's deterministic (or default?) just use C
.
Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?
Nope CI builds are currently broken. It works locally so I pushed a manual build.
Nightly builds should be fixed by PR https://github.com/timescale/timescale-analytics/pull/154
Is it fixed ? or not I have same error
latest docker image produces the error on schema from TimescaleDb getting started page:
select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
ERROR: deserialization error invalid collation "pg_catalog"."C.UTF-8"
CONTEXT: extension/src/hyperloglog.rs:126:31
Relevant system information:
Describe the bug Getting the rows from our timescale db continuous Materialized view, we keep getting "ERROR: deserialization error invalid collation "pg_catalog"."en_US.utf8" CONTEXT: extension/src/hyperloglog.rs:106:5 SQL state: XX000"
To Reproduce Steps to reproduce the behavior:
Create table:
Create Materialized view:
Add some data to the table
Try to grab some data from this view
Error appears
Additional information: Using PGAdmin to retrieve the code of the view and executing it partially, it appears it's caused within:
just before the inner-join.