Closed kbrock closed 1 year ago
@kshnurov You have anything to add?
I already had this PR in the works and was happy when you mentioned binary string comparison and indexes in our other discussion.
I agree binary should be the default and always used, otherwise you lose big on performance.
What's the issue with postgres sorting? Binary just compares bytes.
by default, people are using utf8 for the ancestry
columns. So they are not using an index for LIKE
. Using that index is the main draw for this whole format in the first place.
Probably need to document the case where people can migrate from utf8 collated column to an ascii column.
@kbrock why are you mentioning postgres? Can you reproduce the issues you're talking about?
Also, there's no need to use collation, there's a binary type, add_column :table, :ancestry, :binary
@kbrock what's the rush with merging something that isn't complete or correct?
Thank you for asking me to show my work. I was just focused on Postgres.
Setting the column collation
to C
on the ancestry
column goes from a table scan to an index scan.
In other cases, I have seen multiple OR
clauses force a table scan, but luckily that is not the case for this test.
#!/usr/bin/env ruby
# basic_benchmark.rb
require 'active_record'
require 'ancestry'
ActiveRecord::Base.establish_connection(ENV.fetch('DATABASE_URL') { "postgres://localhost/ancestry_benchmark" })
ActiveRecord::Migration.verbose = false
ActiveRecord::Schema.define do
create_table :users, force: true do |t|
col_options = {}
col_options[:collation] = ENV["COL"] if ENV["COL"].present?
t.string :ancestry, col_options
t.string :name
end
add_index :users, :ancestry
end
STDERR.puts "#created schema"
class User < ActiveRecord::Base ; has_ancestry ; end
def create_tree(parent, name: 'Tree 1', count: 2, depth: 2)
if parent.kind_of?(Class) # root
root = parent.create(name: name)
create_tree(root, name: name, count: count, depth: depth - 1)
else
count.times do |i|
# parent_id: parent.id
child = parent.children.create(name: "#{name}.#{i+1}")
create_tree(child, name: child.name, count: count, depth: depth - 1) if depth > 1
end
end
end
create_tree(User, name: 'Tree 1', count: 8, depth: 4)
STDERR.puts "#created tree (#{User.count})"
root=User.roots.order(:id).last # root of tree
puts root.subtree.explain
createdb ancestry_benchmark
DATABASE_URL=postgres://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=postgres://localhost/ancestry_benchmark COL=C ruby basic_benchmark.rb
#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..24.18 rows=9 width=72)
Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
(2 rows)
#created schema
#created tree (585)
EXPLAIN for: SELECT "users".* FROM "users" WHERE ("users"."ancestry" LIKE '1/%' OR "users"."ancestry" = '1' OR "users"."id" = 1)
QUERY PLAN
---------------------------------------------------------------------------------------------------
Bitmap Heap Scan on users (cost=12.91..23.50 rows=9 width=72)
Recheck Cond: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
Filter: (((ancestry)::text ~~ '1/%'::text) OR ((ancestry)::text = '1'::text) OR (id = 1))
-> BitmapOr (cost=12.91..12.91 rows=9 width=0)
-> Bitmap Index Scan on index_users_on_ancestry (cost=0.00..4.32 rows=4 width=0)
Index Cond: (((ancestry)::text >= '1/'::text) AND ((ancestry)::text < '10'::text))
-> Bitmap Index Scan on index_users_on_ancestry (cost=0.00..4.31 rows=4 width=0)
Index Cond: ((ancestry)::text = '1'::text)
-> Bitmap Index Scan on users_pkey (cost=0.00..4.28 rows=1 width=0)
Index Cond: (id = 1)
(10 rows)
Mysql is not as explicit on explaining the query, but we know that it is using the primary
index for the id
and the index on ancestry
. This is the case both before and after the changes.
While the tests are still passing, it looks like mysql is sending binary data for ancestry
, So probably changing the column collation
to binary
for mysql is incorrect.
It looks like setting the character set
to ASCII
may be the best approach, but that is not supported by rails and it would force developers to use schema.sql
, which I would like to avoid.
mysql --host localhost --port 3306 -u root -e 'CREATE SCHEMA IF NOT EXISTS 'ancestry_test';'
DATABASE_URL=mysql2://localhost/ancestry_benchmark ruby basic_benchmark.rb
DATABASE_URL=mysql2://localhost/ancestry_benchmark COL=binary ruby basic_benchmark.rb
#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE '1/%' OR `users`.`ancestry` = '1' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | users | NULL | ALL | PRIMARY,index_users_on_ancestry | NULL | NULL | NULL | 585 | 11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)
#created schema
#created tree (585)
EXPLAIN for: SELECT `users`.* FROM `users` WHERE (`users`.`ancestry` LIKE x'312f25' OR `users`.`ancestry` = x'31' OR `users`.`id` = 1)
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
| 1 | SIMPLE | users | NULL | ALL | PRIMARY,index_users_on_ancestry | NULL | NULL | NULL | 585 | 11.41 | Using where |
+----+-------------+-------+------------+------+---------------------------------+------+---------+------+------+----------+-------------+
1 row in set (0.00 sec)
@kshnurov I misunderstood your comment and thought you said yes.
If you notice the hack from the test that was removed, the default collation was affecting the way that Postgres sorted/compared the data. Mac and RedHat version of Postgres use a collation that takes symbols into consideration when comparing strings. Debian based versions of Postgres have a non-standard collation. This collation is case insensitivity and does not take symbols into consideration. It returns data in a different order.
But the order of the returned values is just a symptom of the underlying problem. The database is doing extra processing that render the index ineffective on relative comparisons with the string index.
This can be fixed by any number of changes including: column data type, character set, collation, and string comparison operators.
I feel collation is the simplest, most supported, and probably the best way to get LIKE 'a%'
working again on Postgres.
It seems simple for developers to understand. Unicode is complicated, so we just to simplify it and just do a bitwise comparison. The column only contains numbers and symbols, or possibly ascii letters in the uuid case. Just doing a byte comparison works in this case.
Now on the mysql front, maybe collation is not the correct answer. We either want to roll it back or possibly want to use latin1
. The query plan doesn't make the comparison look that different. I have not used mysql since rails 1 or 2, so I hope you have an opinion on the best way to move forward here.
@kbrock why not just use the binary
data type instead of collation? I've been using it with mysql for a while, works perfectly, can't see why it would be different with other DBs. It does exactly what we need - stores/compares raw bytes.
@kshnurov I'm all ears here.
I've seen people embed ancestry
into a form fields or manually manipulate the fields using ruby e.g.: ancestry + '/%'
, "#{ancestry}/%"
, split('/').map { |r| find(r).name }.join('/')
. Will this work in with a binary field?
Using ascii
character type seems to make the most sense to me, but I guess I'm just not familiar with what a binary
type is. The collation binary
seemed strange to me as well. Using a binary
data type sure looks very simple.
The only reservation I have is producing SQL that is not readable or is tricky to copy paste.
If you think binary is the way to go, then it makes sense to change the test suite to specifically test with the preferred format.
Of course it will work. binary
is a byte string both in MySQL and Postgres.
It's still a string, looks and operates the same way:
I'm using t.binary "ancestry", limit: 3000, null: false
because MySQL has an index limit of 3072 bytes, you'll get an error if you exceed that. I'm not very familiar with PG, but it looks like the default limit is 2730 bytes.
Also, a limit <= 4095 is needed so Rails will create a varbinary
column instead of blob
.
We can fix that at 2700, that's an ancestry depth of 300-600 with int primary keys and ~81 with uuids.
Add documentation and tests for users to set collation on the
ancestry
column.We need to encourage users to use a binary/ascii collation for the
ancestry
column otherwise they and loosing the main benefit of materialized path, specifically the fact that we use an index forancestry LIKE '$ancestry/%'
.I'm hoping we can drop
ENV["ANCESTRY_COLLATION"]
which is there because canonical ignores punctuation when sorting/comparing strings. Getting us to use the proper indexes means all systems will behave in a more consistent manner.