mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.84k stars 154 forks source link

[RFC] heterogeneous Irmin stores via effectful optics #909

Open craigfe opened 4 years ago

craigfe commented 4 years ago

This is a mirror of a document available at CraigFe/ocaml-generics.

RFC - Heterogeneous Irmin stores via Effectful Optics

Problems with Irmin at present

Irmin is parameterised over a user-supplied data type to be used for representing leaves (blobs) of the store. The blob type b must be passed with:

With this API, all blobs in the store must have the same OCaml representation. This leads to unpleasant interactions when attempting to store different types of data in different regions of the store. In particular, we have observed two different issues:

  1. the user knows the high-level type structure of their store.

Application constraints quite often lead to certain regions of the Irmin store having pre-determined type:

camels/dromedary/**  # All blobs have type 'dromedary'
camels/bacrian/**    # All blobs have type 'bactrian'
cacti/**             # All blobs have type 'cactus'

With the current API, the only thing to be done is to make a tagged union and assert false in the appropriate places:

type blob = Dromedary of dromedary | Bactrian of bactrian | Cactus of cactus
[@@deriving irmin]

let get_spike () =
  Store.get s [ "cacti"; "dangerous"; "spike" ] >|= function
  | Cactus c -> c
  | _ -> assert false (* never occurs in a well-formed application *)

let set_camelus c =
  Store.set ~info s [ "camels"; "bactrian"; "camelus" ] (Bactrian c)
  (* must tag with `Bactrian` and promise never to give the wrong tag *)

This works, but has three issues

  1. the user knows the low-level type structure of their store.

The type of a blob may also be determined by a suffix of the path used to access it, for example:

**/*.md     # All have type 'markdown'
**/*.txt    # All have type 'text'

Again, we can get around this at the application level by using a sum type for the blobs:

type file = Markdown of markdown | Text of text [@@deriving irmin]
(* where `markdown` and `text` have appropriate serialisers defined *)

let get_markdown name =
  Store.get s [ "files"; (name ^ ".md") ] >|= function
  | Markdown m -> m
  | _ -> assert false

but the three above are still prevalent. In this particular example, we might particularly care about introducing serialised tags, since it would cause the .md files to no longer be parse-able as Markdown files when using a file-system backend for Irmin.

Proposed solution

Optical get/set API

A better API to solve problem #1 might look like the following:

let get_spike () = Store.get s (cacti / steps ["dangerous"; "spike"])
(* path is typed to return cactuses ∴ no need to untag & assert false*)

let set_camelus c = Store.set ~info s (camels / bactrian / step "camelus") c
(* path is typed to set bactrians ∴ no need to tag *)

where cacti, camels, bactrian are some form of 'first-class projection' that can be used to get/set a sub-component of a complex type. Such objects are typically referred to as optics. In this case, the 'complex' type is store as a product type defined by:

type camel_store = {
  dromedary: dromedary tree;
  bactrian: bactrian tree;
}

type desert_store = {
  camels: camel_store
  cacti: cactus tree
}

where 'b tree is a standard Irmin tree object with a single blob type 'b and indexed by a sequence of 'steps' (optics over product types are known as 'lenses').

Similarly, we can imagine a better API for the file-system case:

let get_markdown name = Store.get s (step "files" / name / md)

This is effectively the same as what we were doing before, with two differences:

Types of tree node

To generalise from these examples, the proposal is to generalise from having the user define 'blob' types to defining the type structure of the entire store. We will have three types of tree object, which (might) be nested arbitrarily inside each other:

val tree : 'a Irmin.Type.t -> 'a tree Irmin.Type.t

(* blob projections from a sequence of steps *)
val steps : string list -> ('a heterogeneous_store, 'a) lens

This has the advantage of easily reproducing the old behaviour as b tree while also allowing arbitrary algebraic type structure to be encoded into the Irmin store via generics. The old API can be preserved as a simplified Contents functor that uses b tree as the store type internally and hides all uses of optics.

Monad-parameterised optics

A key advantage of using optics as an Irmin interface is to represent the indirection of the content-addressable store: accessing/updating the field of a 'store'-d record requires access to a content-addressable heap that may have its own effect monad. We can provide an optics library that is parameterised over these effects, and then specialise them for Irmin:

module Lens_with_effect (Effect : sig

  (* Monadic effects in the heap *)
  type +'a io
  val return : 'a -> 'a io
  val bind : 'a io -> ('a -> 'b io) -> 'b io

  (* Reference / hash in the content-addressable heap *)
  type +'a addr
  val deref : 'a addr -> 'a io
  val update : 'a addr -> ('a -> 'b) -> 'b addr io
end) =
struct
  open Effect

  type (-'s, +'t, +'a, -'b) lens

  (* Composition with indirection *)
  val ( / ) :
    ('a, 'b, 'c addr, 'd addr) lens ->
    ('c, 'd, 'e addr, 'f addr) lens ->
    ('a, 'b, 'e addr, 'f addr) lens

  (* Get/set with effect and indirection *)
  val view : ('s, 't, 'a addr, 'b addr) t -> 's -> 'a io
  val modify : ('s, 't, 'a addr, 'b addr) t -> ('a -> 'b) -> 's -> 't io
end

There are many such constructions that achieve the same expressivity: if anyone has an idea of a principled way of selecting one, please let me know :slightly_smiling_face:

Generic construction of optics

It is possible to derive the necessary lenses and prisms with generics, avoiding introducing any additional boilerplate:

open Type

(* First example -- defining generics by hand *)
let camel_store, Lens.[ dromedary; bactrian ] = (* Get the generic and the lenses *)
  record "camel_store" (fun dromedary bacrian -> { dromedary; bactrian })
  |+ field "dromedary" (tree dromedary) (fun s -> s.dromedary)
  |+ field "bactrian" (tree bactrian) (fun s -> s.bactrian)
  |> sealr_lens                                        (* Seal with lens *)

let desert_store, Lens.[ camels; cacti ] =
  record "desert_store" (fun dromedary bacrian -> { dromedary; bactrian })
  |+ field "camels" camel_store (fun s -> s.dromedary)
  |+ field "cacti" (tree cactus) (fun s -> s.bactrian)
  |> sealr_lens

module Contents = struct
  type t = desert_store
  let t = desert_store
  let merge = undefined
end

Another (more extensive) option might be to provide optic construction for arbitrary generics.

Backend-imposed restrictions

Certain backends might want to impose restrictions / conventions about the type structure contained in the store in ways that make sense for that specific data representation. For example, the Git-compatible backends could restrict sum types to a dedicated 'extension' lens that may only be applied at the end of the path.

The solution should be functorised in such a way as to allow for this use-case.

(* Second example -- using PPX for generics

   The exact API here would depend on how we decide to specialise the optic
   interface in order to allow the user to control the FS layout precisely. Here
   I assume a sum type with annotated correspondences to file-names. *)

type file =
  | Markdown of markdown [@file "*.md"]
  | Text of text [@file "*.txt"]
[@@deriving irmin]

module Contents = struct
  type t = file tree
  let t = tree file
  let merge = undefined
end

The big picture

Disregarding backend-specific specialisations, the new API might look something like this:

module type STORE = sig
  type contents (** user-defined top-level store type *)

  (** Interactions with the store **)

  val modify : info:Info.f -> t -> (contents, 'a) optic -> ('a -> 'a) -> (t, write_error) result Lwt.t
  val set :    info:Info.f -> t -> (contents, 'a) optic ->        'a  -> (t, write_error) result Lwt.t

  val merge :
    info:Info.f ->
    t ->
    ?old:'a ->
    (contents, 'a) optic ->
    'a option ->
    (t, write_error) result Lwt.t

  val get : t -> (contents, 'a) optic -> 'a option Lwt.t

  (* NOTE: we no longer need `*_tree` variants, since these are simply instances
     of optics provided by Tree (below). *)
end

We need to provide a new API for homogeneous trees too:

module type TREE = sig
  type +'a t (** homogeneous trees with blob type 'a *)
  type step  (** user-defined step type *)

  val t : 'a Irmin.Type.t -> 'a t Irmin.Type.t

  (* Low-level optics *)
  val step : step -> ('a t, 'a t) optic
  val blob : ('a t, 'a) optic

  val steps : string -> ('a t, 'a) optic
  (** [steps [s1; s2; s3]] is (step s1 / step s2 / step s3 / blob) *)
end

Advantages

Limitations

samoht commented 4 years ago

That looks great!

For extra clarity, it would be very useful if you could show how the two first examples can be encoded in this new scheme and what the API will look like for the end-user.

craigfe commented 4 years ago

@samoht: me know if you want to see extra detail in either of these two regards, or have thoughts about the FS layout API :slightly_smiling_face:

samoht commented 4 years ago

I need to think more about this but just a few quick remarks:

Also, on a more general design level, I am not sure if we need to expose the lens and optic types to the end-user. I'd like to avoid having to explain to them what are lens/optics and how they work. I feel that the current definition are specialised enough that we could just fold them into the Tree signature (e.g. have Lens == Tree++) or something similar. What do you think?

craigfe commented 4 years ago

modify is set in the current Irmin API. We should probably keep this.

Quite :slightly_smiling_face: I was imagining providing both but gave the more general one in my example. I'll update the RFC to reflect this.

What API do you see the ppx expending too?

I haven't thought about this properly yet, which is why I chose to use the PPX for that example :wink: What we need is a sum where the cases are mapped onto disjoint sets of filenames. As a very rough initial guess, we could have something like:

type file =
  | Markdown of markdown
  | Text of text

let file, Optic.[ markdown; text ] =
  variant "desert_store" (fun markdown text -> function
      | Markdown m -> markdown m
      | Text t -> text t)
  |~ case1 "markdown" markdown (fun m -> Markdown m)
  |~ case1 "text" text (fun t -> Text t)
  |> sealv_prism
  |> directory (function 
       | `Markdown m -> m ^ ".md" 
       | `Text t -> t ^ ".txt")

where file, markdown and text have the following types:

val file : (file directory) Irmin.Type.t
val markdown : string -> (file directory, markdown) prism
val text : string -> (file directory, text) prism

We could then access a particular file with: Store.get s (step "files" / markdown name), since markdown is an optic parameterised on a string to be inserted in place of the wildcard given to the [@file "*.md"] annotation.

I suspect nicer solutions are possible with new sum type generic combinators specialised for this particular use-case. It seems likely to me that several backends would expose such additional combinators in order to achieve particular store structure, where this structure is visible to the user (FS, GraphQL etc.).

It's still a bit unclear to me how these two kinds of of typed store will interact. Eg. could you have this kind of schema:

camels/dromedary/**  # All blobs have type 'dromedary'
camels/bacrian/**    # All blobs have type 'bactrian'
cacti/*.x            # All have type 'cactus.x'
cacti/*.y            # All have type 'cactus.y'

Yes, the idea is to allow arbitrary nesting of different node types in the general case (and then have individual backends impose restrictions on this when necessary). That schema could be encoded with something like.

let cactus_dir, Optic.[ x; y ] =
  variant "cactus" (fun x y -> function | X a -> x a | Y a -> y a)
  |~ case1 "x" cactus_x (fun a -> X a)
  |~ case1 "y" cactus_y (fun a -> Y a)
  |> sealv_prism
  |> directory (function
       | `Markdown m -> m ^ ".x"
       | `Text t -> t ^ ".y")

let desert_store, Lens.[ camels; cacti ] =
  record "desert_store" (fun dromedary bacrian -> { dromedary; bactrian })
  |+ field "camels" camel_store (fun s -> s.dromedary)
  |+ field "cacti" cactus_dir (fun s -> s.bactrian)
  |> sealr_lens

Also, on a more general design level, I am not sure if we need to expose the lens and optic types to the end-user. I'd like to avoid having to explain to them what are lens/optics and how they work. I feel that the current definition are specialised enough that we could just fold them into the Tree signature (e.g. have Lens == Tree++) or something similar. What do you think?

I'd like to avoid requiring an understanding of optics too, and think it's fairly easy to do so with some clever placement. The optics we need could be included in e.g. Path (for effectful optic composition (/)) and Tree (for homogeneous tree optics), and we could move the optic list syntax elsewhere to avoid needing to open either Lens or Prism.

However, I think we should still expose the generic optics in e.g. Irmin.Type for advanced users who want to use them. Even better would be to extract the generics to a separate library (I'd like to use them in my other projects :slightly_smiling_face:), but as we've discussed before that's likely to be difficult.

icristescu commented 4 years ago

Having read up a bit on optics I now understand better your proposal, I also think it's great ! I'm not sure I understand the section on the monad parametrised optics, in particular what is the interplay here with the content addressable store? Is the Effect the content addressable store?

If I understand correctly, you need this monad to compose optics? Does it mean that you cannot access *.md in cacti/ without it?

hannesm commented 4 years ago

I appreciate work on this topic. My usage of irmin is usually in the shape described above:

IIUC, the proposal above takes the first two use cases into consideration. Do you have any thoughts about the third use case?

samoht commented 4 years ago

for saving configuration, embedding a GADT (https://github.com/hannesm/gmap) where each value type depends on its key. it's also a flat store, i.e. no nesting. I wrapped the interface to provide a val get : t -> 'a key -> ('a, err) result Lwt.t (and corresponding set). All keys are known upfront (in contrast to hmap).

If you know all the keys, is it possible to enumerate them all to create a big record and use lenses? Would be great to have dynamic typed keys but I'am not sure how that could be typable. Or we need some kind of extensible records...

hannesm commented 4 years ago

I could create a big record, but would this allow single-key get and set, or would the record be get and set/updated all at once?

craigfe commented 4 years ago

Having read up a bit on optics I now understand better your proposal, I also think it's great ! I'm not sure I understand the section on the monad parametrised optics, in particular what is the interplay here with the content addressable store? Is the Effect the content addressable store?

If I understand correctly, you need this monad to compose optics? Does it mean that you cannot access *.md in cacti/ without it?

@ioana: This section of my proposal is not particularly well explained (I'm not sure what exactly to call this pattern). There are two 'effects' at play here:

(* Monad of effects in the heap *)
type +'a io
val return : 'a -> 'a io
val bind : 'a io -> ('a -> 'b io) -> 'b io

(* ~Comonad to capture indirection via hashing/dereferencing *)
type +'a addr
val deref : 'a addr -> 'a io
val update : 'a addr -> ('a -> 'b) -> 'b addr io

The point of bringing this up in the proposal is that the optics library will need to be parameterised over both internal and external effects in this way. For instance, we want to be able to compose lenses that have an internal addr indirection:

val ( / ) : ('a, 'b, 'c addr, 'd addr) lens ->
    ('c, 'd, 'e addr, 'f addr) lens ->
    ('a, 'b, 'e addr, 'f addr) lens

which requires parameterisation over the addr comonad. We also need to expose the io effect in the approriate places when the lenses are actually used:

val view : ('s, 't, 'a addr, 'b addr) t -> 's -> 'a io

There are examples of this sort of 'optics + effects' API from the Haskell world (https://hackage.haskell.org/package/lens-action-0.2.4/docs/Control-Lens-Action.html#t:Effective) but I don't know of any in OCaml (it's going to be syntactically heavyweight in OCaml because of the need for functors for higher-kinded polymorphism).

craigfe commented 4 years ago

IIUC, the proposal above takes the first two use cases into consideration. Do you have any thoughts about the third use case?

@hannesm. For the third use-case, this scheme would encode that with a large product type as @samoht suggests. I intend to restrict the scope of heterogeneous stores to only statically known type structure, to avoid needing to reason about 'unsealed' extensible type witnesses in the Irmin core (I already find these difficult enough to work with when building the generics themselves).

Not to say that extensible products/sums couldn't be provided; only that I think it introduces more complexity than could reasonably be implemented in a single iteration.

I could create a big record, but would this allow single-key get and set, or would the record be get and set/updated all at once?

I'm not sure exactly what you mean here. The idea is that 'record' tree objects would have the same underlying implementation as standard homogeneous tree objects with (step, node addr) map contents, so it should be possible to set a single key 'individually', bearing in mind that doing so will change the hash of the blob and therefore necessitate a new tree object being created in the heap (just as before).

hannesm commented 4 years ago

thanks for explanation, I'm looking forward to adjust my code using optics.

anentropic commented 3 years ago

I was looking for the recipe of how to do this... I have a store where certain key paths should be treated as JSON with a known structure (and custom merge logic) and other key paths are just opaque strings (effectively immutable)

So +1 on this RFC proposal

But my question is about the current workaround:

type blob = Dromedary of dromedary | Bactrian of bactrian | Cactus of cactus
[@@deriving irmin]

let get_spike () =
  Store.get s [ "cacti"; "dangerous"; "spike" ] >|= function
  | Cactus c -> c
  | _ -> assert false (* never occurs in a well-formed application *)

let set_camelus c =
  Store.set ~info s [ "camels"; "bactrian"; "camelus" ] (Bactrian c)
  (* must tag with `Bactrian` and promise never to give the wrong tag *)

I can see how this works for get and set, but how does merge behave in this scenario? If I commit values of different types to a branch and then merge it to master, is there any hope of using the type-specific merge implementation for each value with the current code?