Make it easier to change the serialization format for the ancestry column

stefankroes / ancestry

Organise ActiveRecord model into a tree structure

MIT License

3.72k stars 458 forks source link

Make it easier to change the serialization format for the ancestry column #599

Closed kbrock closed 1 year ago

kbrock commented 1 year ago

@kshnurov I stalled on #481 and have a question for you:

class Model < Ar:Base
  serialize :options, Hash
  # ====== HERE
  serialize :ancestry, MaterializedPathSerializer # or :ancestor_ids
  # ====== HERE

  ancestry_attribute :root_id,    :field => :ancestry, :offset => 0
  ancestry_attribute :parent_id,  :field => :ancestry, :offset => -1

  belongs_to_array :ancestors # use ancestor_ids array as a list of records
  belongs_to       :root
  belongs_to       :parent
  has_many         :siblings, :foreign_key => :parent_id
end

When dealing with options, the serialization format of the options in the database is not really a concern. We only deal with options as a Hash, and we ignore the database serialization format.

When dealing with ancestry, we are aware of the 2 concepts: the int[] data (i.e.: ancestor_ids) and the string serialized data (i.e.: ancestry).

When I implemented the serializer, the code got much simpler. But rails pushes towards having the database string ancestry as an int[] throughout the code. This made the code very incompatible with anyone attempting to monkey patch and really use the code.

Do you have logistic questions on how to deal with merging ancestry and ancestor_ids concepts?

kshnurov commented 1 year ago

@kbrock yeah, I've already looked at it a few times thinking serialize or custom attribute type would make things better, but realized it'll just save like 20 lines of current code (a lot has changed since #481) in one place by adding 50-100 in another and making things slower since most of the time we don't need to deserialize ancestry to work with it. Methods like descendants, children and subtree use raw string. Rails would fire deserialization on every init, whereas right now it's only done when needed.

Associations (belongs_to :parent) won't work with STI anyway.

I might be missing something, what benefits of using custom type do you see?

kbrock commented 1 year ago

Yea. Even though it doesn't feel like that much has changed, it has changed enough to make rebasing that PR basically impossible. I've tried quite a few times and failed.

I want a database friendly ancestry column. I want rails associations for the rails relationships.

I really don't like the current process:

nodes        = Model.where(:important => true)  #
ancestry     = nodes.map(&:ancestry)            # load src records/ (need: ancestry and id)
ancestor_ids = ancestry.map(&:split('/'))       # decode
# a
parent_ids   = ancestor_ids.map(&:last)         # manipulate
parents      = Model.where(:id => parent_ids)   # load tgt records
# b
parent_ancestry = ancestor_ids.map(&:clone).map(&:pop) # manipulate
parents    = Model.where(:ancestry => parent_ancestry) # load tgt records

I would prefer:

nodes = Model.where(:important => true) # scope 
node_roots = Model.where(:id => ANCESTRY_ROOT_OF(nodes)) # load tgt records

==> To change this process, the column needs to be easier to manipulate in the database or in a format that the database understands. Bonus points if you could just use ancestry and not need the id column as well. Bonus points if the database understands the format represented a tree.

==> So this requires we make the ancestry column storage format more database friendly. Which requires probably changing the serialization of ancestry. I am curious about using a postgres array, LTree, MaterializedPath2, indexing string_to_array(ancestry), or a path that includes it's own id.

That may buy joins, includes, preloading, better querying, easier leaf detection, performance improvements.

But the ruby manipulation of the column, and the knowledge of the serialization format is so central to the code. So modifying the serialization format feels too difficult.

In my mind, moving the delimiter and serialization mechanism into a single (standard?) rails concept and out of the bulk of the code is a good first step to changing serialization and getting us to sql queries.

This also buys us standard dirty code (i.e.: ancestry_changed?), standard validation, and possibly handling multiple hierarchies in a single node.

If the serialization format is accessible via arel, then there may be a way to converting the associations (e.g.: children) from a scope into a real association, at least for a few of the associations.

kshnurov commented 1 year ago

@kbrock I'd like that too, but there's no database-friendly data type for ancestry that'll allow joins, includes and all that stuff. DB array/json search definitely won't be faster than a LIKE '..%' on an indexed binary string, and rails doesn't support db array/json search in associations anyway. The only way you can do joins with rails is by having an id in a separate column.

There are 2 efficient approaches to ancestry:

Storing it in a string. Pros:
- you can get the whole (sub)tree, or any part of it, or all siblings in 1 simple and fast query
- you can easily get any or all parents without additional queries - you know their ids and order right away
- you can instantly know element's depth
- you don't need any extra tables Cons:
- you have to validate ancestry
- you can't do joins This is what this gem does.
Storing it in a separate table: Pros:
- you can do joins
- database foreign keys will do validations for you Cons:
- joins on each query
- ancestry takes more DB space This is what closure_tree does.

In terms of performance and querying, neither of them is always better or faster. It really depends on your use case.

If, for some reason, you really need joins on root/parent, you have 2 options:

Use the second approach
Add a generated column to your DB which will extract root_id/parent_id from ancestry string (with MaterializedPath2: root_id bigint GENERATED ALWAYS AS split_part(ancestry, '/', 2) for postgresql, root_id BIGINT GENERATED ALWAYS AS SUBSTRING_INDEX(ancestry,'/',2) for mysql) and use it with belongs_to :root.

kbrock commented 1 year ago

When I have the ancestry value and want to find related records, then a binary search on the ancestry column is pretty great.

But I think you agree that the materialized path 2 encoding works better than materialize path with no disadvantages. It is possible that there are some other formats that have other tradeoffs.

There are a number of presentations talking about the pros/cons closure trees, materialized path, adjacency lists, nested sets. None of them explore the implementation of the materialized path and ways to mitigate some of the disadvantages.

The 2 use cases that I would like to explore is when you start with an id or when you start with a scope. Is it possible to efficiently bring back a relation (children). These are the use cases that I would like to explore and understand the pros/cons.

A serialized column may be the wrong way to get flexibility with the format of the ancestry column.

I wonder if it is possible to structure the code in such a way that the library would support any of these schemes, and allow developers to come up with more. Again, I'm sticking with materialized path, just not in the form of id/id/id.

kshnurov commented 1 year ago

I haven't heard of any other useful and efficient scheme. Do you have a real use case where you need that? Cause I don't and I can't imagine one. I think materializedpath2 should be even made the default and the only one available.

Single ancestry doesn't need any kind of prefix, just use a separate column, that'll be faster and easier. Multiple ancestries? If I needed that I'd go with the database approach and store all ancestry in a separate table, that way I can have any amount of ancestries per object, instead of being limited by particular number and having to juggle with ancestry_1, ancestry_2 and so on.

kbrock commented 1 year ago

I agree with you that materializedpath2 should be the default.

The books, blog posts, and presentations all seem to use the "id/id" format (without the trailing slash) regardless of database or programming language. That seems like a major oversight with a very logical reason why the trailing slash is needed. For postgres, the presence of a single OR tends to tell the query optimizer that a table scan is preferred.

I think a leading slash is not important for the LIKE clause, but I do wonder the difference of storing an ancestry of NULL vs "" vs "/" for the root node, but not enough to run tests.

Personally, I need more sql friendly formats. I have one 30k row table with a relatively narrow ancestry, and another 100k-ish table with very wide ancestry trees. But bringing back the data to then query the database hits very hard. Pagination causes more issues. Building the includes queries with a CASE in sql have bought us a considerable performance boost.

I really wonder about the LTree datatype. I guess it makes sense that it does not interest you as much since you are more MySql focused. But it is a native tree type that is very similar to the materialized path format. It just uses a delimiter of '.', and adds a few special case operators. To make this work, I would need more flexible sql generation. Almost doesn't feel possible with the current code base.

I'm also curious about storing an ancestry of "/parent_id/my_id". Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap. Seeing how much I emphasize removing any OR clause, probably makes this idea no surprise for you. (not saying you agree, just that you can see why I propose it)

I've been hesitant to use arrays, but really feel they have some benefits. It would allow a GiN index (inverted index - again postgres only). and relatively direct joins. The keys in the path would be an actual integer rather than requiring casts like ::integer in the code.

kshnurov commented 1 year ago

I think a leading slash is not important for the LIKE clause, but I do wonder the difference of storing an ancestry of NULL vs "" vs "/" for the root node, but not enough to run tests.

Storing difference is 1 byte, but querying difference is huge - LIKE vs LIKE OR IS NULL.

100k-ish table with very wide ancestry trees. But bringing back the data to then query the database hits very hard

Sounds like you're doing something wrong, shouldn't be slow. Are you using materializedpath2 and a binary field with an index on it? Can you show a code/query example?

I really wonder about the LTree datatype

There's a ltree_hierarchy gem for that. I don't see how you'll be able to use it with Rails without custom SQL, and my guess is that a binary string index still will be faster.

I'm also curious about storing an ancestry of "/parent_id/my_id". Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap.

I don't understand the benefits, can you explain them? Deleting/re-parenting by primary id will be way faster.

I've been hesitant to use arrays, but really feel they have some benefits. It would allow a GiN index (inverted index - again postgres only). and relatively direct joins.

Not sure if it'll be faster, but anyway - there's no Rails support for that, and it's database dependent.

kbrock commented 1 year ago

I'm also curious about storing an ancestry of "/parent_id/my_id". Yes, we see the downside of requiring 2 writes for every create. But reading, deleting, and re-parenting is cheap.

I don't understand the benefits, can you explain them? Deleting/re-parenting by primary id will be way faster.

I should have typed "/$parent_id/$id/" with the trailing /

Index will be unique (not much of a win)
remove OR when comparing with id or ancestry
only ever deal with the ancestry column.
reparenting is just a gsub.

I do see what you mean by using a primary id

kshnurov commented 1 year ago

remove OR when comparing with id or ancestry

Which OR? The only one I can think of is the one in subtree, and I don't see any problem with it, primary key index is way faster than any indexed string.

only ever deal with the ancestry column

Yeah, and you won't even be able to use an index, LIKE "%/id/" can't use indexes when % is on the left. Just fetching a record will give you a full table scan instead of a fast primary key index. What's the point?

reparenting is just a gsub.

gsub on all records in a database? Good luck with that, it'll be thousands of times slower than UPDATE ... SET ancestry=X WHERE id IN(...)

I don't see any benefits in that approach, but I see a ton of performance issues.

kbrock commented 1 year ago

You are right, maybe that idea was a bad one? But even so, it does look like order(ancestry) does have some benefits, and having child_ancestry == ancestry may have some benefit, as well. So while I too am skeptical, it looks like it does have some advantages. Not sure about the hit for disadvantages, especially for a read heavy, write light model. I would have to look further to properly understand the down sides to know if it could be worth it. But I do know that there are a few cases where /ancestry/id does things that /ancestry/, id can not.

I haven't heard of any other useful and efficient scheme. Do you have a real use case where you need that? Cause I don't and I can't imagine one.

Efficient means it is quick in the things that you do often, and possibly slow in the things that you don't do (often). There are trade offs for every implementation.

I want ancestry to perform better in my use cases, so I have a need.

And Yes, I have already shared a few tweaks to the materialized path pattern. So I think I have established that I have imagined more than one. Maybe some of them are bad. Only one way to find out.

I did come up with materialized_path2 which you seem to like. I have faith that there are some other improvements out there that work for me (or you).

Associations (belongs_to :parent) won't work with STI anyway.

Well, not sure about STI being the problem. I think it is the fact that parent_id is not an actual column.

But I think I may have a way around this

kbrock commented 1 year ago

I am not suggesting dropping all uses of id. Of course, where(id: 1) is quicker than a LIKE '%/1'.

I mentioned this elsewhere and want to bring it back into here so this conversation is complete.

There are benefits when a record is grouped with the child records. They are ordered in the order we want to display the records (assuming you want order by created_at).

1
- 4
2
3

order(:ancestry, :id): 1, 2, 3, 4 order("ancestry || id"): 1, 4, 2, 3

The last one does not need to arrange in order to be useful.