networktocode / diffsync

A utility library for comparing and synchronizing different datasets.
https://diffsync.readthedocs.io/
Other
155 stars 26 forks source link

Refactor parent/child relationship #255

Open jamesharr opened 1 year ago

jamesharr commented 1 year ago

Environment

Proposed Functionality

Currently in DiffSync, the tree hierarchy has shown a few limitations, most notably that objects can not form a tree using the same type. For example, a user cannot form a hierarchy of Locations (Site -> Building -> Floor -> Room). Additionally, this parent/child relationship must be declared when creating the Models, which might not be entirely necessary.

Adapting to a new tree implementation may resolve several outstanding feature requests, such as #239 and #225, as well as allow for the implementation of #122 and a get_parent() method which does not exist today.

This change proposes:

Use Case

# Note that MyModelA & MyModelB declarations would not include _children

be1 = MyBackEnd()
a1 = MyModelA(...)
a2 = MyModelA(...)
b3 = MyModelB(...)
b4 = MyModelB(...)

be1.add(a1, parent=None)  # Note that parent=None is the default
be1.add(a2, parent=a1)  # Specify a parent object
be1.add(b3, parent=a2)  # Specify a parent
be1.add(b4, parent=None)  # Another object of the same type, but without a parent

# Getter methods
be1.get_parent(a2)  # Will return a1
be1.get_parent(a1)  # Will return None; though it could also raise an Exception
be1.get_children(a2)  # Will return something that iterable (that happens to only have b3 in it)
be1.get_children(b4)  # Will return an empty iterator/list

# Alternate signatures to support Model + ids style signatures
# IE when you don't have the original object.
be1.add(a2, parent_model=MyModelA, parent_ids={"k": "v", ...})
be1.get_parent(model=MyModelA, ids={"k": "v", ...})
be1.get_children(model=MyModelA, ids={"k": "v", ...})

One thing I haven't figured out entirely is if it makes sense to support moving an object between parents without delete/recreate

Implementation Thoughts

I've been thinking about the best place to store the parent/child relationship information. I think it's actually better to store them purely in the backend (in a dedicated tree structure) vs having the DiffSyncModel be aware of the parent/child relationships. Using this method would keep DiffSyncModel focused on storing data instead of getting involved in part of the tree.

@dataclass
class TreeNode:
    """Internal data structure to represent an object in a tree"""
    object: DiffSyncModel
    parent: None | DiffSyncModel = None
    children: None | Dict[str, Dict[str, DiffSyncModel]] = None  # children[model_name][identifier] = Object

    def add_child(self: Node, child: DiffSyncModel) -> TreeNode:
    """Add a tree to a node"""
        n = TreeNode(object=child, parent=self)
    self.children = parent.children or {}  # or use defaultdict()
    self.children.setdefault(child._modelname, {})  # or use defaultdict()
    self.children[child._modelname][child.get_identifier()] = n
        return n

# How TreeNode could be represented in a backend
class DiffSync:
    top_level_nodes: Dict[str, Dict[str, TreeNode]]   # top_level_nodes[model][identifier] = TreeNode(...)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

    # alternative - use a synthetic root node to reduce the number of special-cases for top-level objects
    root_node: TreeNode(object=None)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

    # The root_node / top_level_nodes would establish only the top-level nodes
    # The all_nodes would allow for get_parent / get_children method to quickly locate an object in the tree,
    # as well as allow DiffSync to enforce key-uniqueness.

I'm going a bit off-topic, but instead of TreeNode storing a reference to a DiffSyncModel, it could instead store a reference to just the object's identifier; This might make it easier for Python's GC to do its job at the expense of having to look up an object by its key.

@dataclass
class TreeNode:
    """Internal data structure to represent an object in a tree"""
    model: str  # Model name
    id: str  # object key
    parent_model: None | str = None  # Parent model
    parent_id: None | str = None  # Parent key
    children: None | Dict[str, Set[str]] = None  # children[model_name] = Set(ids)  # Set of IDs

class DiffSync:
    # Key lookup of all objects in the backend
    all_objects: Dict[str, Dict[str, DiffSyncModel]] # all_objects[model][identifier] = DiffSyncModel()

    # Tree structure
    top_level_nodes: Dict[str, Dict[str, TreeNode]]   # top_level_nodes[model][identifier] = TreeNode(...)
    all_nodes: Dict[str, Dict[str, TreeNode]]  # all_nodes[model][identifier] = TreeNode(...)

Related Issues

Kircheneer commented 11 months ago

I have a counter proposal for this one. In the past I have worked with multiple complex diffsync implementations and what I have come to notice is that children themselves may be problematic. I have instead opted for an approach where we use a custom Diff class with custom ordering that should be generalizable to where children aren't needed. After all, the only problem they solve is ordering of operations, and if we can do away with their complexity and still solve that problem I think its a win. What do you think @jamesharr?

See https://github.com/networktocode/diffsync/issues/97 for a code example.

jamesharr commented 11 months ago

I have a counter proposal for this one. In the past I have worked with multiple complex diffsync implementations and what I have come to notice is that children themselves may be problematic. I have instead opted for an approach where we use a custom Diff class with custom ordering that should be generalizable to where children aren't needed. After all, the only problem they solve is ordering of operations, and if we can do away with their complexity and still solve that problem I think its a win. What do you think @jamesharr?

See https://github.com/networktocode/diffsync/issues/97 for a code example.

I think flattening the hierarchy and using a custom diff order is probably what I'm going to end up doing. I think you're right the most problems boil down to an order of operations problem, but not all.

I think native support for an arbitrary hierarchy might allow for smoother error handling (if the parent fails, don't try to process the children) and avoiding duplicate errors (deleting a parent deletes the child), but for what I'm doing that's a "nice to have"

Ty for the suggestion and link. I personally feel like storing the parent/child relationship in the Backend instead of the Model is a good idea though. It keeps the Model focussed on data and not relationships and I think it would simplify the API for adding a child object.