mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.83k stars 153 forks source link

Semantics of modify/delete conflict #1434

Open kayceesrk opened 3 years ago

kayceesrk commented 3 years ago

On the heels of #1433, here is an execution whose equivalent behaviour using Git would trigger a modify/delete conflict. More seriously, Irmin merge here is not commutative.

The idea here is that there is a file in LCA which is deleted in master branch and modified in wip branch. This represents concurrent modify and delete operation. Git reports a conflict as there cannot be an automated resolution procedure that favours one of the operations over the other.

(* git_test2.ml *)
#require "digestif.c"
#require "checkseum.c"
#require "irmin-unix"

open Lwt.Syntax

(* Irmin store with string contents *)
module Store = Irmin_unix.Git.FS.KV(Irmin.Contents.String)

(* Database configuration *)
let config = Irmin_git.config ~bare:false "/tmp/irmin/test"

(* Commit author *)
let author = "Example <example@example.com>"

(* Commit information *)
let info fmt = Irmin_unix.info ~author fmt

let main =
  (* Open the repo *)
  let* repo = Store.Repo.v config in

  (* Load the master branch *)
  let* master = Store.master repo in

  (* Set key "a" to "a" in "master" *)
  let* () = Store.set_exn master ~info:(info "Updating a") ["a"] "a" in

  (* Get key "a" and print it to stdout *)
  let* x = Store.get master ["a"] in
  Printf.printf "master: before remove, before merge -- a => '%s'\n" x;

  (* Create the "wip" branch *)
  let* wip = Store.clone ~src:master ~dst:"wip" in

  (* Set key "a" to "a modified" in "wip" *)
  let* () = Store.set_exn wip ~info:(info "modifying a") ["a"] "a modified" in

  (* Get key "a" and print it to stdout *)
  let* x = Store.get wip ["a"] in
  Printf.printf "wip: after modify, before merge -- a => '%s'\n" x;

  (* Remove the key "a" from "master" *)
  let* _ = Store.remove master ~info:(info "Remove a") ["a"] in

  Printf.printf "Merging master into wip\n%!";
  let* _ =
    Store.merge_into ~info:(info "Merging master into wip") master ~into:wip
  in

  (* Get key "a" and print it to stdout *)
  let* x = Store.get wip ["a"] in
  Printf.printf "wip: after modify, after merge -- a => '%s'\n" x;

  Printf.printf "Merging wip into master\n%!";
  let* _ =
    Store.merge_into ~info:(info "Merging wip into master") wip ~into:master
  in

  (* Get key "a" and print it to stdout *)
  let* x = Store.get master ["a"] in
  Printf.printf "master: after remove, after merge -- a => '%s'\n" x;

  Lwt.return ()

(* Run the program *)
let () = Lwt_main.run main
$ utop
utop # #use "git_test2.ml"
...
master: before remove, before merge -- a => 'a'
wip: after modify, before merge -- a => 'a modified'
Merging master into wip
wip: after modify, after merge -- a => 'a modified'
Merging wip into master
Exception: (Invalid_argument "Irmin.Tree.get: /a not found")
Raised at file "stdlib.ml", line 30, characters 25-45
Called from file "src/core/lwt.ml", line 1867, characters 23-26
Re-raised at file "src/core/lwt.ml", line 3027, characters 28-29
Called from file "src/unix/lwt_main.ml", line 27, characters 10-20
Called from file "src/unix/lwt_main.ml", line 114, characters 8-13
Re-raised at file "src/unix/lwt_main.ml", line 120, characters 10-13
Called from unknown location
Called from file "toplevel/toploop.ml", line 208, characters 17-27

While merging master branch into wip seems to prefer modify over delete (the read after merge returns the modified value), merging wip into master seems to prefer the delete over modify (the key is not found). This makes the Irmin merge non-commutative.

I guess the fundamental issue here is that Irmin tries to treat the tree nodes as a mergeable replicated data type (MRDT), while Git does not. This makes Irmin diverge from Git semantics. Independently, the merge on trees has bugs (merge not being commutative). One option is to actually report both #1433 and this as a merge conflict, and leave it to the user code to fix the conflicts (by ensuring that the problematic keys are updated).

samoht commented 3 years ago

This is odd -- Merge.alist (used by Node.merge) indeed prefers modify over delete (as it iterates on the existing filenames in the 3 values to merge). I don't understand why merging wip into master seems to prefer the delete over modify. I'll try to investigate this further.

Note: we could probably change this to something a bit more consistent (especially if it simplifies the semantics) - but I'd like to avoid looking at the filenames in the LCA if it is not necessary (as it's a costly operation).

kayceesrk commented 3 years ago

Note: we could probably change this to something a bit more consistent (especially if it simplifies the semantics) - but I'd like to avoid looking at the filenames in the LCA if it is not necessary (as it's a costly operation).

The question is whether Irmin wants to mirror Git semantics for merging trees. We can choose a different semantics than Git, but we should be clear about what the new semantics is.

samoht commented 3 years ago

I think it's fine to not keep the same semantic as Git - as long as the new semantics is clear, consistent and easy to explain (which I don't think the Git semantics particularly is).

samoht commented 3 years ago

Also: Node.merge and Tree.merge do no currently behave the same; Tree.merge seems to handle modify/delete conflicts better, but we need to make all of this much more consistent.

kayceesrk commented 3 years ago

What is the Node.merge and Tree.merge mentioned here? If we choose to go for a semantics that diverges from Git, we should probably write down our motivations for doing it.

Is it for explainability?

Given that we now have the formal operational semantics for merging trees that correspond to Git behaviour, I don't find it complicated :-) Perhaps a little bit of work is necessary to explain the semantics to non-PL-researchers. This can certainly be done with the help of exhaustive examples + intuition behind the choice.

I should mention that I do find Git's semantics intuitive if you start from the idea that Git doesn't want to enforce a particular merge policy when there is conflicting intent. For example, if concurrent branches modify and delete the same file (this issue), there is no automatic merge that doesn't enforce a policy -- whether modify survives or delete survives. Git tries to avoid making that decision, leaving it to the user code to fix the conflict.

Also, it is unclear that if we do go for a semantics that is not Git, it would be more explainable. In the case of concurrent modify and delete, you could prioritise modify. This choice is reminiscent of the add-wins CRDT set semantics [1]. Now, this choice certainly is automatic, but is this more explainable or intuitive? Why did we not choose delete-wins semantics? Given that we enforce a policy here, there is no way to encode the delete-wins semantics, unlike Git which leaves it to the user to decide what to do.

Is it for performance?

What are the performance metrics that we are optimising for? You mention that you want to avoid looking at filenames in the LCA. Are we certain that the costs are unacceptable for real programs? Are there benchmarks where we can evaluate the costs. What other constraints should we care about? These constraints will influence the design of the merge.

Happy to modify the formal semantics to capture the intended behaviour. CC @ssdubey.

[1] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#OR-Set_(Observed-Remove_Set)