Experiment: divide-and-conquer (impatience diff)

numist commented 4 years ago

Now that arrow diff is back, it should be totally possible to perform a diff that divides the input into smaller chunks and recurses:

    ┌───────────────────────────────────────────┐
    │"p" "p" "i" "i" " " "*" "P" "M" "v" "."    │
    ├───────────────────────────────────────────┤
    │"p" "p" "i" " " "*" "P" "i" "." "i" "M" "v"│
    └───────────────────────────────────────────┘
                          │
                          │         Filter shared unique elements
                          ▼
              ┌───────────────────────┐
              │" " "*" "P" "M" "v" "."│
              ├───────────────────────┤
              │" " "*" "P" "." "M" "v"│
              └───────────────────────┘
                          │
                          │         Diff and keep matches
                          ▼
                ┌───────────┬───────┐
                │" " "*" "P"│"M" "v"│
                ├───────────┼───────┤
                │" " "*" "P"│"M" "v"│
                └───────────┴───────┘
                      │         |
                      │         |   Use matches to divide input
                      │         :
                      │          `────────.
                      │                    ╲
                      │                     :
      Diff            ▼          Diff       ▼    Diff
┌───────────────┬───────────┬───────────┬───────┬───┐
│"p" "p" "*" "i"│" " "*" "P"│           │"M" "v"│"."│
├───────────────┼───────────┼───────────┼───────┼───┤
│"p" "p"     "i"│" " "*" "P"│"i" "." "i"│"M" "v"│   │
└───────────────┴───────────┴───────────┴───────┴───┘

result: -"*"               +"i" +"." +"i"       -"."

At a high level, this is basically what patience diff does, but using arrow (ordered set diffing) instead of myers for computing the pivots (hence: impatience) and recursing instead of just accepting that the entirety of the subranges as different.

numist commented 4 years ago

Initial results from testBtree79ce96ab39Tof55ea8f456ByLine:

it's roughly the same wall clock time as club
it uses more comparisons (but has less overhead)
diff quality is significantly better

        ┌───────────────────────────────────────────────────────────┐
        │                          Legend:                          │
        │                                                           │
        │ 'n=' -- size of the collections being diffed              │
        │'|Δ|' -- count of individual changes in the diffs          │
        │ '==' -- count of element equality evaluations             │
        │  '⌛︎' -- wall clock time (when both algorithms exceed 10ms)│
        └───────────────────────────────────────────────────────────┘
================================= hybrid/myers =================================
n=(10518,9878):  |Δ|:2238/1832 (+22.2%)  ==:160603/1690406 (9.5%)  ⌛︎:0.25/0.95 (26.0%)
▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼   ▼ 
n=(10518,9878):  |Δ|:1832/1832 (+0.0%)  ==:323535/1690406 (19.1%)  ⌛︎:0.25/0.95 (26.6%)
================================================================================

numist commented 4 years ago

Unfortunately I don't really understand the heuristics for prudently deploying a divide and conquer approach (it would produce an inferior diff for an input like 0000001234 → 1234000000) and there are a lot of regressions (even comparing to myers) in the PRNG test testByPercentageChangedDoubledUUID.

I also suspect that a lot of the benefit promised by divide-and-conquer is already realized by club in its eager match-seeking behaviour.

Pushed impatience-diff.

numist / Diffing-Explorations

Experiment: divide-and-conquer (impatience diff) #8