tvondra / sequential-uuids

generator of sequential UUIDs
https://blog.2ndquadrant.com/sequential-uuid-generators/
MIT License
302 stars 18 forks source link

Pure pgSQL implementation #2

Open shayded-exe opened 5 years ago

shayded-exe commented 5 years ago

Hello, I created a pure pgSQL implementation of the time-based generation method.

Hopefully this is useful for those of us using a service like AWS RDS where we can't install C-based extensions.

It doesn't support customizing interval_count, but that should be trivial to add.

CREATE OR REPLACE FUNCTION generate_sequential_uuid(p_interval_length int DEFAULT 60)
  RETURNS uuid
  LANGUAGE plpgsql
AS $$
DECLARE
  v_i int;
  v_time bigint;
  v_bytes int[16] = '{}';
  v_hex text[16] = '{}';
BEGIN
  v_time := floor(extract(epoch FROM clock_timestamp()) / p_interval_length);
  v_bytes[1] := v_time >> 8 & 255;
  v_bytes[2] := v_time & 255;

  FOR v_i IN 3..16 LOOP
    v_bytes[v_i] := floor(random() * 256);
  END LOOP;

  FOR v_i IN 1..16 LOOP
    v_hex[v_i] := lpad(to_hex(v_bytes[v_i]), 2, '0');
  END LOOP;

  RETURN array_to_string(v_hex, '');
END $$;
jackturnbull commented 5 years ago

Thanks for posting this. I decided to go with this implementation. What's worth considering for people attempting to write an equivalent function within their application code is that some languages, or at least Crystal, use a floor-rounded integer value for their float to integer conversion. In the above it appears to be that Postgres rounds the integer value of the float returned from extract(epoch FROM clock_timestamp()) / p_interval_length.

This caught me out for a while because I could see that with a interval length of 60, the UUID was ticking upwards at each half-minute within Postgres, but my Crystal implementation was ticking exactly on the minute.

I decided to round the above snippet rather than in my application code, so I'm using;

v_time := floor(extract(epoch FROM clock_timestamp()) / p_interval_length);

Where my application implementation is;

require "uuid"

struct UUID
  def self.sequential_random(random = Random::Secure, variant = Variant::RFC4122, version = Version::V4)
    new_bytes = random.random_bytes(16)
    v_time = (Time.utc.to_unix_f / 60).to_u
    new_bytes[0] = (v_time >> 8 & 255).to_u8;
    new_bytes[1] = (v_time & 255).to_u8;

    # `new` calls the default UUID constructor, so the byte-array to UUID conversion happens in that function.
    new(new_bytes, variant, version)
  end
end

Now I don't think too many people will have use for the Crystal snippet but this implementation detail might help others implementing it in their own application software.

shayded-exe commented 5 years ago

@jackturnbull Interesting note about the rounding. I suppose that would only be an issue if you're generating ids both on Postgres and in your application and storing them in the same context.

Glad you found it useful! :)

dinhduongha commented 5 years ago

I also created another version - next_uuid - compatible with ulid. Hope this help.

CREATE OR REPLACE FUNCTION next_uuid(OUT result uuid) AS $$
DECLARE
    now_millis bigint;
    second_rand bigint;
    hex_value text;
    shard_id int:=1;
BEGIN
    -- Can use clock_timestamp() / statement_timestamp()
    select (extract(epoch from transaction_timestamp())*1000)::BIGINT+(extract(milliseconds from transaction_timestamp()))::BIGINT INTO now_millis;
    select ((random() * 10^18)::BIGINT) INTO second_rand;
    -- Uncomment below line to ignore sharding.
    -- select ((random() * 10^6)::INT) INTO shard_id;
    hex_value := RPAD(TO_HEX(now_millis), 12, '0')||LPAD(TO_HEX(shard_id), 4, '0')||LPAD(TO_HEX(second_rand), 16, '0');
    result := CAST(hex_value AS UUID);
END;
$$ LANGUAGE PLPGSQL;
tvondra commented 5 years ago

Nice! I wonder how to best track those, so that users can find them. Keeping them buried in a github issue is not great, because people are unlikely to find them here. OTOH I can't just add them to the extension as that won't work on managed services (which I think is the main point of those SQL-only variants).

I think two things might work

1) Creating a separate SQL-only extension (still does not work on RDS-like environments, but it's easier to get working compared to extension with C functions).

2) Adding them to the docs, and referencing them from the docs as an alternative for cases where installing a C extension is not possible/desirable.

I think (2) is fine, but I'd like to know your opinions as it's your code.

shayded-exe commented 5 years ago

@tvondra I think just linking them in the readme should suffice. I published mine as a gist so it's easier to link to https://gist.github.com/rshea0/f4c2e26829d82ed8d38eb5e6e6374ec2

I imagine that most people won't bother installing such a small SQL-only extension anyways and will probably just copy the function into their DB scripts.

daesu commented 4 years ago

@PachowStudios thanks for the snippet. Maybe I'm missing something but order by id is giving unordered results using the uuid's generated with this. Am I missing something ?

shayded-exe commented 4 years ago

@daesu Over what period of time were the ids generated? The ids will only be ordered if they were generated within the same interval period. You can increase the interval length if needed. The point isn't to have every id be ordered, but for them to not be completely random.

@tvondra Maybe you can comment on this? Am I correct?

KnowZero commented 4 years ago

I haven't tried the C version, but doing a quick benchmark of 100,000 all that is here

Baseline = EXPLAIN ANALYZE SELECT uuid_generate_v4() FROM generate_series (1,100000) AS g; Execution Time: 395.623 ms

@PachowStudios = EXPLAIN ANALYZE SELECT generate_sequential_uuid() FROM generate_series (1,100000) AS g; Execution Time: 3181.545 ms

@dinhduongha = EXPLAIN ANALYZE SELECT next_uuid() FROM generate_series (1,100000) AS g; Execution Time: 1351.426 ms

pgulid with base32 removed: EXPLAIN ANALYZE SELECT generate_ulid_uuid() FROM generate_series (1,100000) AS g; Execution Time: 712.480 ms

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION generate_ulid_uuid()
RETURNS uuid
AS $$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN
  -- 6 timestamp bytes
  unix_time = (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::BIGINT;
  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 40)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 5, unix_time::BIT(8)::INTEGER);

  RETURN encode( timestamp || gen_random_bytes(10) ,'hex')::uuid;
END
$$
LANGUAGE plpgsql
VOLATILE;

So generate_sequential_uuid non-C version seems to be almost 8X slower than uuid_generate_v4. next_uuid seems to be about 3.416X slower and pgulid implementation is only 1.8X slower.

I think the pgulid implementation is the way to go. Do note it doesn't seem to be compatible with next_uuid since they use different methods of timestamp accuracy?

Edit: The other downside is you are dependent on pgcrypto extension, which is installed by default with contrib. That said, I wanted to see what happens if we merge both version of next_uuid and pgulid.

EXPLAIN ANALYZE SELECT generate_ulid_uuid2() FROM generate_series (1,100000) AS g; Execution Time: 615.667 ms

It's actually faster! And only 1.556X slower than uuid_generate_v4. That said, random is postgres and many languages isn't true random, so the pgcrypto version is probably still more reliable I guess?

CREATE OR REPLACE FUNCTION generate_ulid_uuid2()
RETURNS uuid
AS $$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN

  unix_time = (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::BIGINT;
  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 40)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 5, unix_time::BIT(8)::INTEGER);

  RETURN (encode( timestamp ,'hex')||LPAD(TO_HEX((random() * 10^6)::INT), 4, '0')||LPAD(TO_HEX((random() * 10^18)::BIGINT), 16, '0'))::uuid;
END
$$
LANGUAGE plpgsql
VOLATILE;

Edit2:

Also, if someone wants to give up ULID compatibility for larger random numbers pool but doesn't mind that it will wrap around in 84 years. (based on the pgcrypto implementation with same speed). What it does is effectively start epoch at 2020 instead of 1970, which allows 1 less character being used for timestamp without loss of precision, but it is limited to only 84 years after which it wraps around.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION generate_uuid_84years(
    )
    RETURNS uuid
    LANGUAGE 'plpgsql'

    COST 100
    VOLATILE 

AS $BODY$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN

  unix_time = ((EXTRACT(EPOCH FROM clock_timestamp())-1577836800) * 1000)::BIGINT;

  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, unix_time::BIT(8)::INTEGER);

  RETURN encode( timestamp || gen_random_bytes(11) ,'hex')::uuid;
END
$BODY$;
Tostino commented 3 years ago

Here is another implementation of the time based generator in plpgsql, with the same parameter support as the C extension: https://gist.github.com/Tostino/ca104bff40e704b6db70b9af492664ef

tac-adam-brusselback commented 5 months ago

Here is an improved implementation that is ~2.5x faster than the implementation I left above (as Tostino):

CREATE FUNCTION uuid_time_nextval(interval_length int DEFAULT 60, interval_count int DEFAULT 65536)
    RETURNS uuid
    LANGUAGE plpgsql
AS $BODY$
DECLARE
    v_time bigint;
    v_uuid bytea := E'\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'; -- Zero-filled bytea
    v_prefix_bytes int := 0;
    v_i int;
BEGIN
    -- Validate input parameters
    IF interval_length < 1 THEN
        RAISE EXCEPTION 'length of interval must be a positive integer';
    END IF;
    IF interval_count < 1 THEN
        RAISE EXCEPTION 'number of intervals must be a positive integer';
    END IF;

    -- Compute the time-based prefix
    v_time := floor(extract(epoch FROM clock_timestamp()) / interval_length);

    -- Calculate the number of bytes needed for the prefix
    v_i := interval_count;
    WHILE v_i > 1 LOOP
            v_i := v_i / 256;
            v_prefix_bytes := v_prefix_bytes + 1;
        END LOOP;

    -- Set the time-based prefix bytes
    FOR v_i IN 0..v_prefix_bytes-1 LOOP
            v_uuid := set_byte(v_uuid, v_i, ((v_time >> (8 * (v_prefix_bytes - v_i - 1))) & 255)::integer);
        END LOOP;

    -- Generate random bytes for the rest of the UUID
    FOR v_i IN v_prefix_bytes..15 LOOP
            v_uuid := set_byte(v_uuid, v_i, (floor(random() * 256))::int);
        END LOOP;

    -- Return the hexadecimal representation of the UUID
    RETURN encode(v_uuid, 'hex');
END $BODY$;
alecmev commented 4 months ago

A bigint version, with ~4.5 hour wrap-around:

create function id() returns bigint as $$ begin
  return (
    floor((extract(epoch from clock_timestamp()) / 60) % 256)::int8::bit(8)
    || ('x' || encode(gen_random_bytes(7), 'hex'))::bit(56)
  )::bit(64)::bigint;
end; $$ language plpgsql;

And an even further text-based departure, with log2(alphabet_size ^ number_of_bytes) bits of entropy:

create function id() returns text as $$ declare
  bytes bytea;
  alphabet bytea = '34abcdefhijkmnoprstwxy';
  output text = '';
begin
  bytes = set_byte(
    gen_random_bytes(14),
    0,
    floor((extract(epoch from clock_timestamp()) / 60) % 256)::integer
  );

  for i in 0..(length(bytes) - 1) loop
    output = output || chr(get_byte(
      alphabet,
      get_byte(bytes, i) % length(alphabet)
    ));
  end loop;

  return output;
end; $$ language plpgsql;