Error in create_policy Execution Due to Incorrect tenant_id Key Extraction in schema.rb

Description:

This issue addresses an issue where an error occurs when executing the create_policy statement after specifying tenant_id as a string and running the generated schema.rb.

Environment:

Ruby 3.3.5 (2024-09-03 revision ef084cc8f4) [arm64-darwin23]
Rails 7.1.4
PostgreSQL 16.4

Issue:

When creating a tenant_id as a bigint, the result of "qual" is:

(tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::bigint)

When creating a tenant_id as a varchar, the result of "qual" is:

((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying)::text)

In this PR, code has been added to extract the key for tenant_id from this string. However, in the case of varchar, it retrieves (tenant_id)::text, resulting in the following output in schema.rb:

create_policy "users", partition_key: "(tenant_id)::text"

When this schema.rb is loaded, an error occurs at the create_policy statement:

bin/rails aborted!
(tenant_id)::text column is missing in users

Resolution:

The main purpose of the changes in this PR is to modify the key name for tenant_id. Given that no special specification is required when creating a policy, the following output in schema.rb should suffice:

create_policy "users"

Fix code

diff --git a/lib/activerecord-tenant-level-security/schema_dumper.rb b/lib/activerecord-tenant-level-security/schema_dumper.rb
index a258c3b..9c94232 100644
--- a/lib/activerecord-tenant-level-security/schema_dumper.rb
+++ b/lib/activerecord-tenant-level-security/schema_dumper.rb
@@ -39,7 +39,7 @@ module TenantLevelSecurity
     private

     def convert_qual_to_partition_key(qual)
-      matched = qual.match(/^\((.+?) = /)
+      matched = qual.match(/^\(*([a-zA-Z0-9_]+)(?:\)::[a-zA-Z0-9_]+)*\s*=/)
       # This error can occur if the specification of the 'tenant_policy' in PostgreSQL
       #   or the 'create_policy' method changes
       raise "Failed to parse partition key from 'pg_policies.qual': #{qual}" unless matched

I have also added the code for verification.

# Migration files
# --------------------------------------------------------
# db/migrate/20240918004701_create_tenants.rb
# --------------------------------------------------------
# class CreateTenants < ActiveRecord::Migration[7.1]
#   def up
#     create_table :tenants, id: false do |t|
#       t.string :id, limit: 26, primary_key: true
#       t.string :name
#     end
#   end

#   def down
#     drop_table :tenants
#   end
# end
#
# --------------------------------------------------------
# db/migrate/20240918092205_create_users.rb
# --------------------------------------------------------
# class CreateUsers < ActiveRecord::Migration[7.1]
#   def up
#     create_table :users, id: false do |t|
#       t.string :id, limit: 26, primary_key: true
#       t.references :tenant, null: false, foreign_key: true ,type: :string
#       t.string :email
#       t.string :name
#     end
#     create_policy :users
#   end

#   def down
#     drop_table :users
#   end
# end
#
# --------------------------------------------------------
# schema.rb
# --------------------------------------------------------
# ActiveRecord::Schema[7.1].define(version: 2024_09_18_092205) do
#   # These are extensions that must be enabled in order to support this database
#   enable_extension "plpgsql"

#   create_table "tenants", id: { type: :string, limit: 26 }, force: :cascade do |t|
#     t.string "name"
#   end

#   create_table "users", id: { type: :string, limit: 26 }, force: :cascade do |t|
#     t.string "tenant_id", null: false
#     t.string "email"
#     t.string "name"
#     t.index ["tenant_id"], name: "index_users_on_tenant_id"
#   end

#   add_foreign_key "users", "tenants"

#   create_policy "users"
# end

class TestSchemaDumper
  include TenantLevelSecurity::SchemaDumper

  def self.convert_qual_to_partition_key(qual)
    new.send(:convert_qual_to_partition_key, qual)
  end
end

int_qual = "(tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::bigint)"
string_qual2 = "((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying)::text)"

# Usage examples
int_result = TestSchemaDumper.convert_qual_to_partition_key(int_qual)
print("IntType [#{int_result == 'tenant_id'}] result: #{int_result}\n")
string_result = TestSchemaDumper.convert_qual_to_partition_key(string_qual2)
print("StringType [#{string_result == 'tenant_id'}] result: #{string_result}\n")

Run results

IntType [true] result: tenant_id
StringType [true] result: tenant_id

PR https://github.com/kufu/activerecord-tenant-level-security/pull/25

Thank you for reporting this.

This change makes sense, but I understand it's a compatibility issue. Instead of applying this fix, you can also migrate your policies by following the steps below. https://github.com/kufu/activerecord-tenant-level-security/pull/16#issuecomment-1503111326

This is the best option as it also addresses the issue of no good execution plan being chosen.

If we can easily fix the issue of (tenant_id)::text being mistakenly selected as the partition_key, I would not be opposed to merging #25. However, as pointed out in https://github.com/kufu/activerecord-tenant-level-security/pull/25#discussion_r1808028056, the current logic for extracting the partition key is somewhat ambiguous. Given this, it may be better to drop support for the old policy and return a more clear error message.

What do you think about this, and if there are any reasons why you can't migrate to the new policy?

I have reviewed the relevant PR.

As a premise, we are building a new service and are using "character varying(26)" for the tenant_id.

I confirmed that, by setting AlterPolicies in the PR to

TENANT_ID_DATA_TYPE = "character varying(26)"

the following output SQL is executed:

ALTER POLICY tenant_policy ON users
  USING (tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))
  WITH CHECK (tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26));

Before applying this to everything, I plan to directly execute the above SQL on PostgreSQL.

The policy definition before execution:

\d users
-- omitted --
    POLICY "tenant_policy"
      USING (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))
      WITH CHECK (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))

The policy definition after execution:

\d users
-- omitted --
    POLICY "tenant_policy"
      USING (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))
      WITH CHECK (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))

As a result, the policy is set with (tenant_id)::text, and when dumped, schema.rb outputs create_policy "users", partition_key: "(tenant_id)::text".

Am I misunderstanding something in relation to this PR? Do we expect any changes in the output of schema.rb as a result of this new policy migration?

Hmm, that seems like strange behavior.

There seems to be a bit of an issue with the implementation of AlterPolicies, but putting that aside, queries that perform a cast to character varying(26) seem to always get cast to text in PostgreSQL.

activerecord_tenant_level_security_test=# CREATE POLICY tenant_policy ON string_employees AS PERMISSIVE FOR ALL TO PUBLIC USING (tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'), '')::character varying(26)) WITH CHECK (tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'), '')::character varying(26));
CREATE POLICY
activerecord_tenant_level_security_test=# \d string_employees
                  Table "public.string_employees"
  Column   |         Type          | Collation | Nullable | Default 
-----------+-----------------------+-----------+----------+---------
 tenant_id | character varying(26) |           |          | 
Policies (forced row security enabled):
    POLICY "tenant_policy"
      USING (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))
      WITH CHECK (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))

We should start by investigating whether this case would cause estimation problems and why such a cast is occurring.

Testing tenant_id as `integer`, `character varying(26)`, and `text`

Integer Case

CREATE TABLE IF NOT EXISTS sample_integer (
  tenant_id integer,
  id integer
);

-- Add 1000 rows for tenant_id = 1
INSERT INTO sample_integer (tenant_id, id) VALUES (1, (generate_series(1,1000)));

-- Add 1000 rows for tenant_id = 2
INSERT INTO sample_integer (tenant_id, id) VALUES (2, (generate_series(1,1000)));
ANALYZE sample_integer;

EXPLAIN SELECT * FROM sample_integer 
WHERE tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'), '')::integer;

Result:

                                          QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
Seq Scan on sample_integer (cost=0.00..54.00 rows=1000 width=8)
  Filter: (tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::integer)
(2 rows)

Character varying(26) Case

CREATE TABLE IF NOT EXISTS sample_char (
  tenant_id character varying(26),
  id integer
);

-- Add 1000 rows for tenant_id = "1"
INSERT INTO sample_char (tenant_id, id) VALUES ('1', (generate_series(1,1000)));

-- Add 1000 rows for tenant_id = "2"
INSERT INTO sample_char (tenant_id, id) VALUES ('2', (generate_series(1,1000)));
ANALYZE sample_char;

EXPLAIN SELECT * FROM sample_char 
WHERE (tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text;

Result:

                                          QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
Seq Scan on sample_char (cost=0.00..49.00 rows=1000 width=6)
  Filter: ((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text)
(2 rows)

Text Case

CREATE TABLE IF NOT EXISTS sample_text (
  tenant_id text,
  id integer
);

-- Add 1000 rows for tenant_id = "1"
INSERT INTO sample_text (tenant_id, id) VALUES ('1', (generate_series(1,1000)));

-- Add 1000 rows for tenant_id = "2"
INSERT INTO sample_text (tenant_id, id) VALUES ('2', (generate_series(1,1000)));
ANALYZE sample_text;

EXPLAIN SELECT * FROM sample_text 
WHERE (tenant_id)::text = NULLIF(current_setting('tenant_level_security.tenant_id'), '')::text;

Result:

                                      QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
Seq Scan on sample_text (cost=0.00..44.00 rows=1000 width=6)
Filter: (tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))
(2 rows)

Omitting Explicit Casting for text

EXPLAIN SELECT * FROM sample_text 
WHERE tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'), '')::text;

Result:

                                      QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
Seq Scan on sample_text (cost=0.00..44.00 rows=1000 width=6)
Filter: (tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))
(2 rows)

Results

In terms of cost, the sequence was: text < character varying(26) < integer.

Summary

PostgreSQL's current_setting function returns a text value, as shown by the following command:

SELECT pg_typeof(current_setting('tenant_level_security.tenant_id'));

Result:

pg_typeof
-----------
text
(1 row)

Given that the return type of current_setting is text, if we want to minimize the casting cost in comparisons (e.g., when comparing the value returned from current_setting('tenant_level_security.tenant_id')), it is more efficient to have the tenant_id columns in the relevant tables defined as text.

According to the PostgreSQL documentation (https://www.postgresql.org/docs/current/datatype-character.html), there is no performance difference between different character types like character varying(n) and text.

However, actual performance can vary based on factors such as:

Data distribution
Indexing
The actual number of rows

Taking all factors into consideration, it seems preferable to use text for tenant_id in cases where the tenant ID is represented as a string.

Additional Information

https://github.com/kufu/activerecord-tenant-level-security/issues/24#issuecomment-2433780252

We have updated the Tenants table and all tenant_id columns to use the text data type and have confirmed that everything works as expected.

\d users
-- omitted --
POLICY "tenant_policy"
  USING ((tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text)))
  WITH CHECK ((tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text)))

Even without this PR, since the type of current_setting and tenant_id are the same, there is no need for explicit type conversion, and the output of schema.rb is now correct.

EXPLAIN SELECT * FROM sample_text 
WHERE tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text);

Result:

                                          QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
Seq Scan on sample_text (cost=0.00..44.00 rows=1000 width=6)
  Filter: (tenant_id = NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))
(2 rows)

The execution plan remains unchanged as tenant_id is now defined as text.

We are planning to continue using tenant_id defined as text in the current version. While this PR might be necessary for cases where tenant_id is defined with a string type other than text, we believe it is unnecessary when using the text data type.

Thank you very much for all your support.

While this https://github.com/kufu/activerecord-tenant-level-security/pull/25 might be necessary for cases where tenant_id is defined with a string type other than text, we believe it is unnecessary when using the text data type.

Yeah, at this point, the definitely better solution is to use the text data type.

Whether or not we should merge #25 depends on whether or not we should allow types like character varying(26) in this gem. If it is unavoidable to cast the left side as (tenant_id)::text, PostgreSQL may misestimate the row count and may not choose a better execution plan. See also https://tech.smarthr.jp/entry/2024/01/18/110929 (Japanese)

I checked the linked page (I'm Japanese, so I'm more comfortable with Japanese^^). It seems that when using text, bigint, or uuid, the policy conditions are set in a way that avoids casting on the left side. However, for character varying, even if it is defined like this:

ALTER POLICY tenant_policy ON users
USING (tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))
WITH CHECK (tenant_id = (NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26));

It gets internally converted to:

POLICY "tenant_policy"
  USING (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))
  WITH CHECK (((tenant_id)::text = ((NULLIF(current_setting('tenant_level_security.tenant_id'::text), ''::text))::character varying(26))::text))

So, if you're going to use a string type, it's wise to avoid character varying. If someone is working in an existing environment where the type is already set, it might be helpful to inform them of this behavior.

Whether or not we should merge #25 depends on whether or not we should allow types like character varying(26) in this gem.

Considering that we were able to identify this issue, I believe that not applying this PR could also be a valid decision.

kufu / activerecord-tenant-level-security