Closed kbrock closed 1 year ago
@kbrock yeah, I've already looked at it a few times thinking serialize
or custom attribute type would make things better, but realized it'll just save like 20 lines of current code (a lot has changed since #481) in one place by adding 50-100 in another and making things slower since most of the time we don't need to deserialize ancestry to work with it.
Methods like descendants
, children
and subtree
use raw string. Rails would fire deserialization on every init, whereas right now it's only done when needed.
Associations (belongs_to :parent
) won't work with STI anyway.
I might be missing something, what benefits of using custom type do you see?
Yea. Even though it doesn't feel like that much has changed, it has changed enough to make rebasing that PR basically impossible. I've tried quite a few times and failed.
I want a database friendly ancestry column. I want rails associations for the rails relationships.
I really don't like the current process:
nodes = Model.where(:important => true) #
ancestry = nodes.map(&:ancestry) # load src records/ (need: ancestry and id)
ancestor_ids = ancestry.map(&:split('/')) # decode
# a
parent_ids = ancestor_ids.map(&:last) # manipulate
parents = Model.where(:id => parent_ids) # load tgt records
# b
parent_ancestry = ancestor_ids.map(&:clone).map(&:pop) # manipulate
parents = Model.where(:ancestry => parent_ancestry) # load tgt records
I would prefer:
nodes = Model.where(:important => true) # scope
node_roots = Model.where(:id => ANCESTRY_ROOT_OF(nodes)) # load tgt records
==> To change this process, the column needs to be easier to manipulate in the database or in a format that the database understands. Bonus points if you could just use ancestry and not need the id column as well. Bonus points if the database understands the format represented a tree.
==> So this requires we make the ancestry column storage format more database friendly.
Which requires probably changing the serialization of ancestry
.
I am curious about using a postgres array, LTree, MaterializedPath2, indexing string_to_array(ancestry)
, or a path that includes it's own id.
That may buy joins
, includes
, preloading
, better querying, easier leaf detection, performance improvements.
But the ruby manipulation of the column, and the knowledge of the serialization format is so central to the code. So modifying the serialization format feels too difficult.
In my mind, moving the delimiter and serialization mechanism into a single (standard?) rails concept and out of the bulk of the code is a good first step to changing serialization and getting us to sql queries.
This also buys us standard dirty code (i.e.: ancestry_changed?
), standard validation, and possibly handling multiple hierarchies in a single node.
If the serialization format is accessible via arel, then there may be a way to converting the associations (e.g.: children
) from a scope into a real association, at least for a few of the associations.
@kbrock I'd like that too, but there's no database-friendly data type for ancestry that'll allow joins, includes and all that stuff. DB array/json search definitely won't be faster than a LIKE '..%'
on an indexed binary string, and rails doesn't support db array/json search in associations anyway. The only way you can do joins with rails is by having an id in a separate column.
There are 2 efficient approaches to ancestry:
In terms of performance and querying, neither of them is always better or faster. It really depends on your use case.
If, for some reason, you really need joins on root/parent, you have 2 options:
root_id bigint GENERATED ALWAYS AS split_part(ancestry, '/', 2)
for postgresql, root_id BIGINT GENERATED ALWAYS AS SUBSTRING_INDEX(ancestry,'/',2)
for mysql) and use it with belongs_to :root
.When I have the ancestry value and want to find related records, then a binary search on the ancestry column is pretty great.
But I think you agree that the materialized path 2 encoding works better than materialize path with no disadvantages. It is possible that there are some other formats that have other tradeoffs.
There are a number of presentations talking about the pros/cons closure trees, materialized path, adjacency lists, nested sets. None of them explore the implementation of the materialized path and ways to mitigate some of the disadvantages.
The 2 use cases that I would like to explore is when you start with an id or when you start with a scope. Is it possible to efficiently bring back a relation (children
). These are the use cases that I would like to explore and understand the pros/cons.
A serialized column may be the wrong way to get flexibility with the format of the ancestry column.
I wonder if it is possible to structure the code in such a way that the library would support any of these schemes, and allow developers to come up with more. Again, I'm sticking with materialized path, just not in the form of id/id/id
.
I haven't heard of any other useful and efficient scheme. Do you have a real use case where you need that? Cause I don't and I can't imagine one. I think materializedpath2 should be even made the default and the only one available.
Single ancestry doesn't need any kind of prefix, just use a separate column, that'll be faster and easier.
Multiple ancestries? If I needed that I'd go with the database approach and store all ancestry in a separate table, that way I can have any amount of ancestries per object, instead of being limited by particular number and having to juggle with ancestry_1
, ancestry_2
and so on.
I agree with you that materializedpath2
should be the default.
The books, blog posts, and presentations all seem to use the "id/id"
format (without the trailing slash) regardless of database or programming language. That seems like a major oversight with a very logical reason why the trailing slash is needed. For postgres, the presence of a single OR
tends to tell the query optimizer that a table scan is preferred.
I think a leading slash is not important for the LIKE
clause, but I do wonder the difference of storing an ancestry
of NULL
vs ""
vs "/"
for the root node, but not enough to run tests.
Personally, I need more sql friendly formats. I have one 30k row table with a relatively narrow ancestry, and another 100k-ish table with very wide ancestry trees. But bringing back the data to then query the database hits very hard. Pagination causes more issues. Building the includes
queries with a CASE
in sql have bought us a considerable performance boost.
I really wonder about the LTree datatype. I guess it makes sense that it does not interest you as much since you are more MySql focused. But it is a native tree type that is very similar to the materialized path format. It just uses a delimiter of '.'
, and adds a few special case operators. To make this work, I would need more flexible sql generation. Almost doesn't feel possible with the current code base.
I'm also curious about storing an ancestry
of "/parent_id/my_id"
. Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap. Seeing how much I emphasize removing any OR
clause, probably makes this idea no surprise for you. (not saying you agree, just that you can see why I propose it)
I've been hesitant to use arrays, but really feel they have some benefits. It would allow a GiN index (inverted index - again postgres only). and relatively direct joins. The keys in the path would be an actual integer rather than requiring casts like ::integer
in the code.
I think a leading slash is not important for the LIKE clause, but I do wonder the difference of storing an ancestry of NULL vs "" vs "/" for the root node, but not enough to run tests.
Storing difference is 1 byte, but querying difference is huge - LIKE
vs LIKE OR IS NULL
.
100k-ish table with very wide ancestry trees. But bringing back the data to then query the database hits very hard
Sounds like you're doing something wrong, shouldn't be slow. Are you using materializedpath2 and a binary field with an index on it? Can you show a code/query example?
I really wonder about the LTree datatype
There's a ltree_hierarchy gem for that. I don't see how you'll be able to use it with Rails without custom SQL, and my guess is that a binary string index still will be faster.
I'm also curious about storing an ancestry of "/parent_id/my_id". Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap.
I don't understand the benefits, can you explain them? Deleting/re-parenting by primary id will be way faster.
I've been hesitant to use arrays, but really feel they have some benefits. It would allow a GiN index (inverted index - again postgres only). and relatively direct joins.
Not sure if it'll be faster, but anyway - there's no Rails support for that, and it's database dependent.
I'm also curious about storing an ancestry of "/parent_id/my_id". Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap.
I don't understand the benefits, can you explain them? Deleting/re-parenting by primary id will be way faster.
I should have typed "/$parent_id/$id/"
with the trailing /
I do see what you mean by using a primary id
remove OR when comparing with id or ancestry
Which OR? The only one I can think of is the one in subtree
, and I don't see any problem with it, primary key index is way faster than any indexed string.
only ever deal with the ancestry column
Yeah, and you won't even be able to use an index, LIKE "%/id/"
can't use indexes when % is on the left.
Just fetching a record will give you a full table scan instead of a fast primary key index. What's the point?
reparenting is just a gsub.
gsub
on all records in a database? Good luck with that, it'll be thousands of times slower than UPDATE ... SET ancestry=X WHERE id IN(...)
I don't see any benefits in that approach, but I see a ton of performance issues.
You are right, maybe that idea was a bad one? But even so, it does look like order(ancestry)
does have some benefits, and having child_ancestry == ancestry
may have some benefit, as well. So while I too am skeptical, it looks like it does have some advantages. Not sure about the hit for disadvantages, especially for a read heavy, write light model. I would have to look further to properly understand the down sides to know if it could be worth it. But I do know that there are a few cases where /ancestry/id
does things that /ancestry/, id
can not.
I haven't heard of any other useful and efficient scheme. Do you have a real use case where you need that? Cause I don't and I can't imagine one.
Efficient means it is quick in the things that you do often, and possibly slow in the things that you don't do (often). There are trade offs for every implementation.
I want ancestry to perform better in my use cases, so I have a need.
And Yes, I have already shared a few tweaks to the materialized path pattern. So I think I have established that I have imagined more than one. Maybe some of them are bad. Only one way to find out.
I did come up with materialized_path2 which you seem to like. I have faith that there are some other improvements out there that work for me (or you).
Associations (
belongs_to :parent
) won't work with STI anyway.
Well, not sure about STI being the problem. I think it is the fact that parent_id
is not an actual column.
But I think I may have a way around this
I am not suggesting dropping all uses of id
. Of course, where(id: 1)
is quicker than a LIKE '%/1'
.
I mentioned this elsewhere and want to bring it back into here so this conversation is complete.
There are benefits when a record is grouped with the child records. They are ordered in the order we want to display the records (assuming you want order by created_at
).
order(:ancestry, :id)
: 1, 2, 3, 4
order("ancestry || id")
: 1, 4, 2, 3
The last one does not need to arrange
in order to be useful.
@kshnurov I stalled on #481 and have a question for you:
When dealing with
options
, the serialization format of the options in the database is not really a concern. We only deal withoptions
as aHash
, and we ignore the database serialization format.When dealing with ancestry, we are aware of the 2 concepts: the int[] data (i.e.:
ancestor_ids
) and the string serialized data (i.e.:ancestry
).When I implemented the serializer, the code got much simpler. But rails pushes towards having the database string
ancestry
as an int[] throughout the code. This made the code very incompatible with anyone attempting to monkey patch and really use the code.Do you have logistic questions on how to deal with merging
ancestry
andancestor_ids
concepts?