Closed momania closed 5 years ago
Looks like something else is constantly increasing the metrics_labels_id_seq
.
Shortly after a fresh start, I have ~ 17k labels, and the sequence value is already way past 500k.
I'm not sure if related, but I've got plenty of these in Postgres log:
ERROR: nextval: reached maximum value of sequence "metrics_labels_id_seq" (2147483647)
So the problem is with executing this sql statement: https://github.com/timescale/prometheus-postgresql-adapter/blob/master/postgresql/client.go#L76
Even though there is an on conflict do nothing
clause, the default value for the id will still be requested, hence the sequence will still be incremented.
From https://www.postgresql.org/message-id/9545.1471737989%40sss.pgh.pa.us :
Important: To avoid blocking concurrent transactions that obtain
numbers from the same sequence, a nextval operation is never
rolled back; that is, once a value has been fetched it is
considered used and will not be returned again. This is true even
if the surrounding transaction later aborts, or if the calling
query ends up not using the value. For example an INSERT with an
ON CONFLICT clause will compute the to-be-inserted tuple,
including doing any required nextval calls, before detecting any
conflict that would cause it to follow the ON CONFLICT rule
instead. Such cases will leave unused "holes" in the sequence of
assigned values. Thus, PostgreSQL sequence objects cannot be used
to obtain "gapless" sequences.
I've used something like this myself before to generate unique character based ids:
-- Create a trigger function that takes no arguments.
-- Trigger functions automatically have OLD, NEW records
-- and TG_TABLE_NAME as well as others.
CREATE OR REPLACE FUNCTION unique_short_id()
RETURNS TRIGGER AS $$
-- Declare the variables we'll be using.
DECLARE
key TEXT;
qry TEXT;
found TEXT;
BEGIN
-- generate the first part of a query as a string with safely
-- escaped table name, using || to concat the parts
qry := 'SELECT id FROM ' || quote_ident(TG_TABLE_NAME) || ' WHERE id=';
-- This loop will probably only run once per call until we've generated
-- millions of ids.
LOOP
-- Generate our string bytes and re-encode as a hex string.
key := encode(gen_random_bytes(6), 'base64');
-- Base64 encoding contains 2 URL unsafe characters by default.
-- The URL-safe version has these replacements.
key := replace(key, '/', '_'); -- url safe replacement
key := replace(key, '+', '-'); -- url safe replacement
-- Concat the generated key (safely quoted) with the generated query
-- and run it.
-- SELECT id FROM "test" WHERE id='blahblah' INTO found
-- Now "found" will be the duplicated id or NULL.
EXECUTE qry || quote_literal(key) INTO found;
-- Check to see if found is NULL.
-- If we checked to see if found = NULL it would always be FALSE
-- because (NULL = NULL) is always FALSE.
IF found IS NULL THEN
-- If we didn't find a collision then leave the LOOP.
EXIT;
END IF;
-- We haven't EXITed yet, so return to the top of the LOOP
-- and try again.
END LOOP;
-- NEW and OLD are available in TRIGGER PROCEDURES.
-- NEW is the mutated row that will actually be INSERTed.
-- We're replacing id, regardless of what it was before
-- with our key variable.
NEW.id = key;
-- The RECORD returned here is what will actually be INSERTed,
-- or what the next trigger will get if there is one.
RETURN NEW;
END;
$$ language 'plpgsql';
create table my_table (
id text primary key,
other_valuebytea not null
);
create trigger trigger_key_id BEFORE INSERT ON my_table FOR EACH ROW EXECUTE PROCEDURE unique_short_id();
Don't know if that will be performant enough though for this purpose. š
Hi @momania, thanks for reporting and investigating the issue. I am looking into it.
@momania and @akabos You can maybe execute something like: SELECT setval('metrics_labels_id_seq', MAX(id)) FROM metrics_labels;
to reset the sequence until we fix this issue.
Well, the problem starts occurring when the sequence value outruns the capacity of int4
. There's usually a high probability that by that time there is a fairly recently inserted new label too, which would have gotten an id that is close to that maximum too, making that such a reset won't hold up for very long.
In a dynamic environment where labels come and go this is not an uncommon behaviour.
In my test environment I collected about 25M+ metric values since the reset and the sequence is now just over 25M too, where the max(id)
for my metric labels is just over 21M, so following very close to the number of metric values I have.
It will surely hold up for a long time until it outruns an int4
again, and I can fully reset again since it's a test environment anyway, but for any production setup, the work-around is not going to hold up too long and as such breaks it all.
Anyway, I'm sure you guys figure out a fix soon. If there's anything I can do/test, let let me know š
Is there any easy way to convert that sequence to bigint
?
Sure, you can run ALTER command: https://www.postgresql.org/docs/10/static/sql-altersequence.html . The downside is that using int8 will add 4 bytes to each data point... Also migrating back to int4 would not be possible...
@momania @akabos I've created a PR to fix this. I just tested it locally and it seems to fix the issue. I'll probably give it more love after the weekend.
Since this fix introduces JOIN
before inserting metrics labels there will be some performance cost. I'll run some performance tests to figure out what is the impact so it would be great if you can try as well.
https://github.com/timescale/prometheus-postgresql-adapter/pull/47
Thanks for your effort!
I deployed patched version yesterday and it seem to be working ok so far. Will check for sequence int4 depletion in couple of days.
I've noticed some errors in my logs:
msg="Error sending samples to remote storage" err="pq: integer out of range"
So I went in to investigate.My current value of the
metrics_labels_id_seq
sequence is to big to be used for label ids as this is only anint4
. I have no idea how it got that high (2.150.699.906), as I never had any more than about 350.000 labels.I've cleaned up now and restarted the sequence, and that fixes the issue for now. I'll keep an eye on it to see if I have a rogue label somewhere, but I thought I'll make an issue for it here anyway, as it might be something others have noticed too.
For completeness, I'm using Prometheus version 2.4.3, and Postgres 9.6 with: