timtadh / zhang-shasha

Tree edit distance using the Zhang Shasha algorithm
Other
432 stars 63 forks source link

Subtree matching, not all operations are returned #53

Open BramVanroy opened 4 years ago

BramVanroy commented 4 years ago

I'm not sure whether this is expected behaviour or not. I have two trees. One tree is basically the subtree of another where some nodes are dropped (B, C, E) and where one node has shifted a level(L).

from zss import simple_distance, Node

OPERATIONS = {
    0: 'remove',
    1: 'insert',
    2: 'update',
    3: 'match'
}

A = (
    Node("A")
        .addkid(Node("B"))
        .addkid(Node("C"))
        .addkid(Node("D")
            .addkid(Node("E"))
            .addkid(Node("F"))
            .addkid(Node("G")
                .addkid(Node("H"))
                .addkid(Node("I")))
            .addkid(Node("J")
                .addkid(Node("K"))))
        .addkid(Node("L"))
    )
B = (
    Node("D")
        .addkid(Node("F"))
        .addkid(Node("G")
            .addkid(Node("H"))
            .addkid(Node("I")))
        .addkid(Node("J")
            .addkid(Node("K")))
        .addkid(Node("L"))
    )

dist, opts = simple_distance(A, B, return_operations=True)
print(dist)
# expected to have the same number operations as the max number of items in the tree
print(len(opts))
for opt in opts:
    s = OPERATIONS[opt.type]
    if opt.arg1 is not None:
        s += f"\t{opt.arg1.label}"
    if opt.arg2 is not None:
        s += f"\t{opt.arg2.label}"
    print(s)

Printed output:

5.0
10
remove  E
match   F   F
match   H   H
match   I   I
match   G   G
match   K   K
match   J   J
remove  D
match   L   L
update  A   D

When looking at the operations, though, I don't see any operations done on the labels 'B' and 'C'. Is that specific to the algorithm or the implementation?

As a comparison, here is the output of the same tree comparison with APTED:

from apted import APTED, helpers

src = "{A{B}{C}{D{E}{F}{G{H}{I}}{J{K}}}{L}}"
tgt = "{D{F}{G{H}{I}}{J{K}}{L}}"

tree1 = helpers.Tree.from_text(src)
tree2 = helpers.Tree.from_text(tgt)

apted = APTED(tree1, tree2)
ted = apted.compute_edit_distance()
maps = apted.compute_edit_mapping()

print(f"distance: {ted}")

for edit_map in maps:
    print(' -> '.join(map(str, edit_map)))

Output of apted:

distance: 5
{A{B}{C}{D{E}{F}{G{H}{I}}{J{K}}}{L}} -> {D{F}{G{H}{I}}{J{K}}{L}}
{D{E}{F}{G{H}{I}}{J{K}}} -> None
{E} -> None
{C} -> None
{B} -> None
{F} -> {F}
{G{H}{I}} -> {G{H}{I}}
{H} -> {H}
{I} -> {I}
{J{K}} -> {J{K}}
{K} -> {K}
{L} -> {L}

Thanks in advance!

BramVanroy commented 4 years ago

There was an error in my initial code of the APTED example. Now they do behave the same. The only difference seems to be that zss doesn't return all operations, unless I am missing something.

timtadh commented 4 years ago

The operations code was a external feature contribution (and I have to admit I haven't used it). There could very well be a bug in it -- and it certainly looks like B and D should have been removed. Since the distance being returned is correct my guess is the operations mapping is not correct.

BramVanroy commented 4 years ago

The operations code was a external feature contribution (and I have to admit I haven't used it). There could very well be a bug in it -- and it certainly looks like B and D should have been removed. Since the distance being returned is correct my guess is the operations mapping is not correct.

I think so, too. Unfortunately I don't have the time to look further into this. As per your suggestion over email, I will use APTED in the future.

illeatmyhat commented 1 month ago

I ran into this problem earlier. Even simple trees seem to fail, for example

from zss import Node, simple_distance
bar = Node("f")
foo = Node("f")\
    .addkid(Node("a"))\
    .addkid(Node("b"))\
    .addkid(Node("c"))\
    .addkid(Node("d"))
print(simple_distance(bar, foo, return_operations=True))
# (4.0, [<Operation Insert>, <Operation Match>])
illeatmyhat commented 1 month ago

Not really sure what the problem is or how to fix it, but comparing to implementations in other languages, there's a step that this library doesn't perform here: https://github.com/kaby76/ZhangShashaCSharp/blob/main/Tree.cs#L295-L296 but fd and partial_ops can't be indexed with i and j so can't be done here.