google / diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Apache License 2.0
7.42k stars 1.1k forks source link

How to get char positions of changed text in the original text #51

Open satheeshkatipomu opened 5 years ago

satheeshkatipomu commented 5 years ago

I want to get char positions of changed text along with change tuples. Please help if anyone implemented this?

Example in Python:

import diff_match_patch as dmp_module

dmp = dmp_module.diff_match_patch()
diff = dmp.diff_main("Hello World.", "Goodbye World.")
# Result: [(-1, "Hell"), (1, "G"), (0, "o"), (1, "odbye"), (0, " World.")]
#Desired Result:[(-1,"Hell",0,3,x,x),(0,"o",4,4,1,1),(1,"odbye",x,x,2,6),(0," World",6,11,7,13)]

Here in Desired Result "x" means don't care in "delete" and "insertion" operations.

One way I thought about this is, get the changed text from tuple and find index from original text(this will be starting position") and add length of the changed text to index found, but i think this will fail in case of repeated substring. If implementation is available in other languages also will be helpful for me.

Thanks.

dmsnell commented 4 years ago

@sathk882 can you explain what each value in your example tuples mean? It's not clear what you are asking for. Are you asking for the ranges in the original text and also in the final text of where the characters in the current diff are to be found?

If that's the case then we really just need to think about what we do with each chunk and how it impacts a running pointer into the source strings.

d = [(-1, "Hell"), (1, "G"), (0, "o"), (1, "odbye"), (0, " World.")]

def step( last_a, last_b, next ):
    op, s = next
    l = len(s)

    if -1 == op:
        return ((op, s, (last_a, last_a + l), None), (last_a + l, last_b))

    if 0 == op:
        return ((op, s, (last_a, last_a + l), (last_b, last_b + l)), (last_a + l, last_b+l))

    if 1 == op:
        return ((op, s, None, (last_b, last_b + l)), (last_a, last_b+l))

o = []
last_a = 0
last_b = 0
for chunk in d:
    next, (next_a, next_b) = step(last_a, last_b, chunk)
    last_a = next_a
    last_b = next_b
    o.append(next)

print(o)

[(-1, 'Hell', (0, 4), None), (1, 'G', None, (0, 1)), (0, 'o', (4, 5), (1, 2)), (1, 'odbye', None, (2, 7)), (0, ' World.', (5, 12), (7, 14))]

i think this will fail in case of repeated substring

Can you explain this more?