Document and test with column collation

kbrock commented 1 year ago

Add documentation and tests for users to set collation on the ancestry column.

We need to encourage users to use a binary/ascii collation for the ancestry column otherwise they and loosing the main benefit of materialized path, specifically the fact that we use an index for ancestry LIKE '$ancestry/%'.

I'm hoping we can drop ENV["ANCESTRY_COLLATION"] which is there because canonical ignores punctuation when sorting/comparing strings. Getting us to use the proper indexes means all systems will behave in a more consistent manner.

kbrock commented 1 year ago

@kshnurov You have anything to add?

I already had this PR in the works and was happy when you mentioned binary string comparison and indexes in our other discussion.

kshnurov commented 1 year ago

I agree binary should be the default and always used, otherwise you lose big on performance.

What's the issue with postgres sorting? Binary just compares bytes.

kbrock commented 1 year ago

by default, people are using utf8 for the ancestry columns. So they are not using an index for LIKE. Using that index is the main draw for this whole format in the first place.

Probably need to document the case where people can migrate from utf8 collated column to an ascii column.

kshnurov commented 1 year ago

@kbrock why are you mentioning postgres? Can you reproduce the issues you're talking about? Also, there's no need to use collation, there's a binary type, add_column :table, :ancestry, :binary

kshnurov commented 1 year ago

@kbrock what's the rush with merging something that isn't complete or correct?

kbrock commented 1 year ago

Thank you for asking me to show my work. I was just focused on Postgres.

Postgres

Setting the column collation to C on the ancestry column goes from a table scan to an index scan. In other cases, I have seen multiple OR clauses force a table scan, but luckily that is not the case for this test.

#!/usr/bin/env ruby

# basic_benchmark.rb
require 'active_record'
require 'ancestry'

ActiveRecord::Base.establish_connection(ENV.fetch('DATABASE_URL') { "postgres://localhost/ancestry_benchmark" })

ActiveRecord::Migration.verbose = false
ActiveRecord::Schema.define do
  create_table :users, force: true do |t|
    col_options = {}
    col_options[:collation] = ENV["COL"] if ENV["COL"].present?
    t.string :ancestry, col_options
    t.string :name
  end
  add_index :users, :ancestry
end
STDERR.puts "#created schema"

class User < ActiveRecord::Base ; has_ancestry ; end

def create_tree(parent, name: 'Tree 1', count: 2, depth: 2)
  if parent.kind_of?(Class) # root
    root = parent.create(name: name)
    create_tree(root, name: name, count: count, depth: depth - 1)
  else
    count.times do |i|
      # parent_id: parent.id
      child = parent.children.create(name: "#{name}.#{i+1}")
      create_tree(child, name: child.name, count: count, depth: depth - 1) if depth > 1
    end
  end
end

create_tree(User, name: 'Tree 1', count: 8, depth: 4)
STDERR.puts "#created tree (#{User.count})"

root=User.roots.order(:id).last # root of tree
puts root.subtree.explain

createdb ancestry_benchmark
DATABASE_URL=postgres://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=postgres://localhost/ancestry_benchmark COL=C ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on users  (cost=0.00..24.18 rows=9 width=72)
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
(2 rows)

#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=12.91..23.50 rows=9 width=72)
   Recheck Cond: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
   ->  BitmapOr  (cost=12.91..12.91 rows=9 width=0)
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.32 rows=4 width=0)
               Index Cond: (((ancestry)::text >= '1/'::text) AND ((ancestry)::text < '10'::text))
         ->  Bitmap Index Scan on index_users_on_ancestry  (cost=0.00..4.31 rows=4 width=0)
               Index Cond: ((ancestry)::text = '1'::text)
         ->  Bitmap Index Scan on users_pkey  (cost=0.00..4.28 rows=1 width=0)
               Index Cond: (id = 1)
(10 rows)

Mysql

Mysql is not as explicit on explaining the query, but we know that it is using the primary index for the id and the index on ancestry. This is the case both before and after the changes.

While the tests are still passing, it looks like mysql is sending binary data for ancestry, So probably changing the column collation to binary for mysql is incorrect.

It looks like setting the character set to ASCII may be the best approach, but that is not supported by rails and it would force developers to use schema.sql, which I would like to avoid.

mysql --host localhost --port 3306 -u root  -e 'CREATE SCHEMA IF NOT EXISTS 'ancestry_test';'
DATABASE_URL=mysql2://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=mysql2://localhost/ancestry_benchmark COL=binary ruby basic_benchmark.rb

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE '1/%' OR `users`.`ancestry` = '1' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE x'312f25' OR `users`.`ancestry` = x'31' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys                   | key  | key_len | ref  | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
|  1 | SIMPLE      | users | NULL       | ALL  | PRIMARY,index_users_on_ancestry | NULL | NULL    | NULL |  585 |    11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)

kbrock commented 1 year ago

@kshnurov I misunderstood your comment and thought you said yes.

If you notice the hack from the test that was removed, the default collation was affecting the way that Postgres sorted/compared the data. Mac and RedHat version of Postgres use a collation that takes symbols into consideration when comparing strings. Debian based versions of Postgres have a non-standard collation. This collation is case insensitivity and does not take symbols into consideration. It returns data in a different order.

But the order of the returned values is just a symptom of the underlying problem. The database is doing extra processing that render the index ineffective on relative comparisons with the string index.

This can be fixed by any number of changes including: column data type, character set, collation, and string comparison operators.

I feel collation is the simplest, most supported, and probably the best way to get LIKE 'a%' working again on Postgres.

It seems simple for developers to understand. Unicode is complicated, so we just to simplify it and just do a bitwise comparison. The column only contains numbers and symbols, or possibly ascii letters in the uuid case. Just doing a byte comparison works in this case.

Now on the mysql front, maybe collation is not the correct answer. We either want to roll it back or possibly want to use latin1. The query plan doesn't make the comparison look that different. I have not used mysql since rails 1 or 2, so I hope you have an opinion on the best way to move forward here.

kshnurov commented 1 year ago

@kbrock why not just use the binary data type instead of collation? I've been using it with mysql for a while, works perfectly, can't see why it would be different with other DBs. It does exactly what we need - stores/compares raw bytes.

kbrock commented 1 year ago

@kshnurov I'm all ears here.

I've seen people embed ancestry into a form fields or manually manipulate the fields using ruby e.g.: ancestry + '/%', "#{ancestry}/%", split('/').map { |r| find(r).name }.join('/'). Will this work in with a binary field?

Using ascii character type seems to make the most sense to me, but I guess I'm just not familiar with what a binary type is. The collation binary seemed strange to me as well. Using a binary data type sure looks very simple.

The only reservation I have is producing SQL that is not readable or is tricky to copy paste.

If you think binary is the way to go, then it makes sense to change the test suite to specifically test with the preferred format.

kshnurov commented 1 year ago

Of course it will work. binary is a byte string both in MySQL and Postgres. It's still a string, looks and operates the same way:

I'm using t.binary "ancestry", limit: 3000, null: false because MySQL has an index limit of 3072 bytes, you'll get an error if you exceed that. I'm not very familiar with PG, but it looks like the default limit is 2730 bytes. Also, a limit <= 4095 is needed so Rails will create a varbinary column instead of blob. We can fix that at 2700, that's an ancestry depth of 300-600 with int primary keys and ~81 with uuids.

stefankroes / ancestry

Document and test with column collation #601

Postgres

Mysql