Open craigfe opened 4 years ago
That looks great!
For extra clarity, it would be very useful if you could show how the two first examples can be encoded in this new scheme and what the API will look like for the end-user.
updated the RFC to include definitions of Content
modules for both of the examples (see "Generic construction of optics" and "Backend-imposed restrictions" respectively). The second of these is a bit of a cop-out since I haven't decided what API we might use for mapping homogeneous store types onto a FS hierarchy (although I'm sure that it is possible to do so).
included a possible signature for homogeneous trees and a small subset of STORE
. Use of the store module is almost unchanged, except for the elimination of *_tree
variants.
@samoht: me know if you want to see extra detail in either of these two regards, or have thoughts about the FS layout API :slightly_smiling_face:
I need to think more about this but just a few quick remarks:
modify
is set
in the current Irmin API. We should probably keep this.camels/dromedary/** # All blobs have type 'dromedary'
camels/bacrian/** # All blobs have type 'bactrian'
cacti/*.x # All have type 'cactus.x'
cacti/*.y # All have type 'cactus.y'
Also, on a more general design level, I am not sure if we need to expose the lens and optic types to the end-user. I'd like to avoid having to explain to them what are lens/optics and how they work. I feel that the current definition are specialised enough that we could just fold them into the Tree signature (e.g. have Lens == Tree++) or something similar. What do you think?
modify
isset
in the current Irmin API. We should probably keep this.
Quite :slightly_smiling_face: I was imagining providing both but gave the more general one in my example. I'll update the RFC to reflect this.
What API do you see the ppx expending too?
I haven't thought about this properly yet, which is why I chose to use the PPX for that example :wink: What we need is a sum where the cases are mapped onto disjoint sets of filenames. As a very rough initial guess, we could have something like:
type file =
| Markdown of markdown
| Text of text
let file, Optic.[ markdown; text ] =
variant "desert_store" (fun markdown text -> function
| Markdown m -> markdown m
| Text t -> text t)
|~ case1 "markdown" markdown (fun m -> Markdown m)
|~ case1 "text" text (fun t -> Text t)
|> sealv_prism
|> directory (function
| `Markdown m -> m ^ ".md"
| `Text t -> t ^ ".txt")
where file
, markdown
and text
have the following types:
val file : (file directory) Irmin.Type.t
val markdown : string -> (file directory, markdown) prism
val text : string -> (file directory, text) prism
We could then access a particular file with: Store.get s (step "files" / markdown name)
, since markdown
is an optic parameterised on a string to be inserted in place of the wildcard given to the [@file "*.md"]
annotation.
I suspect nicer solutions are possible with new sum type generic combinators specialised for this particular use-case. It seems likely to me that several backends would expose such additional combinators in order to achieve particular store structure, where this structure is visible to the user (FS, GraphQL etc.).
It's still a bit unclear to me how these two kinds of of typed store will interact. Eg. could you have this kind of schema:
camels/dromedary/** # All blobs have type 'dromedary' camels/bacrian/** # All blobs have type 'bactrian' cacti/*.x # All have type 'cactus.x' cacti/*.y # All have type 'cactus.y'
Yes, the idea is to allow arbitrary nesting of different node types in the general case (and then have individual backends impose restrictions on this when necessary). That schema could be encoded with something like.
let cactus_dir, Optic.[ x; y ] =
variant "cactus" (fun x y -> function | X a -> x a | Y a -> y a)
|~ case1 "x" cactus_x (fun a -> X a)
|~ case1 "y" cactus_y (fun a -> Y a)
|> sealv_prism
|> directory (function
| `Markdown m -> m ^ ".x"
| `Text t -> t ^ ".y")
let desert_store, Lens.[ camels; cacti ] =
record "desert_store" (fun dromedary bacrian -> { dromedary; bactrian })
|+ field "camels" camel_store (fun s -> s.dromedary)
|+ field "cacti" cactus_dir (fun s -> s.bactrian)
|> sealr_lens
Also, on a more general design level, I am not sure if we need to expose the lens and optic types to the end-user. I'd like to avoid having to explain to them what are lens/optics and how they work. I feel that the current definition are specialised enough that we could just fold them into the Tree signature (e.g. have Lens == Tree++) or something similar. What do you think?
I'd like to avoid requiring an understanding of optics too, and think it's fairly easy to do so with some clever placement. The optics we need could be included in e.g. Path
(for effectful optic composition (/)
) and Tree
(for homogeneous tree optics), and we could move the optic list syntax elsewhere to avoid needing to open either Lens
or Prism
.
However, I think we should still expose the generic optics in e.g. Irmin.Type
for advanced users who want to use them. Even better would be to extract the generics to a separate library (I'd like to use them in my other projects :slightly_smiling_face:), but as we've discussed before that's likely to be difficult.
Having read up a bit on optics I now understand better your proposal, I also think it's great !
I'm not sure I understand the section on the monad parametrised optics, in particular what is the interplay here with the content addressable store? Is the Effect
the content addressable store?
If I understand correctly, you need this monad to compose optics? Does it mean that you cannot access *.md
in cacti/
without it?
I appreciate work on this topic. My usage of irmin is usually in the shape described above:
string
as value atm, so suboptimal),.ics
is icalendar, .prop
is a property-file (s-expression atm)),val get : t -> 'a key -> ('a, err) result Lwt.t
(and corresponding set
). All keys are known upfront (in contrast to hmap).IIUC, the proposal above takes the first two use cases into consideration. Do you have any thoughts about the third use case?
for saving configuration, embedding a GADT (https://github.com/hannesm/gmap) where each value type depends on its key. it's also a flat store, i.e. no nesting. I wrapped the interface to provide a val get : t -> 'a key -> ('a, err) result Lwt.t (and corresponding set). All keys are known upfront (in contrast to hmap).
If you know all the keys, is it possible to enumerate them all to create a big record and use lenses? Would be great to have dynamic typed keys but I'am not sure how that could be typable. Or we need some kind of extensible records...
I could create a big record
, but would this allow single-key get and set, or would the record be get and set/updated all at once?
Having read up a bit on optics I now understand better your proposal, I also think it's great ! I'm not sure I understand the section on the monad parametrised optics, in particular what is the interplay here with the content addressable store? Is the
Effect
the content addressable store?If I understand correctly, you need this monad to compose optics? Does it mean that you cannot access
*.md
incacti/
without it?
@ioana: This section of my proposal is not particularly well explained (I'm not sure what exactly to call this pattern). There are two 'effects' at play here:
'a addr
: data that requires the context of a content-addressable store to be meaningful. Here shown as a comonad that runs inside the optics.'a io
: the effect monad of operations on the content-addressable store itself (likely 'a Lwt.t
in the case of Irmin). This runs outside the optics.(* Monad of effects in the heap *)
type +'a io
val return : 'a -> 'a io
val bind : 'a io -> ('a -> 'b io) -> 'b io
(* ~Comonad to capture indirection via hashing/dereferencing *)
type +'a addr
val deref : 'a addr -> 'a io
val update : 'a addr -> ('a -> 'b) -> 'b addr io
The point of bringing this up in the proposal is that the optics library will need to be parameterised over both internal and external effects in this way. For instance, we want to be able to compose lenses that have an internal addr
indirection:
val ( / ) : ('a, 'b, 'c addr, 'd addr) lens ->
('c, 'd, 'e addr, 'f addr) lens ->
('a, 'b, 'e addr, 'f addr) lens
which requires parameterisation over the addr
comonad. We also need to expose the io
effect in the approriate places when the lenses are actually used:
val view : ('s, 't, 'a addr, 'b addr) t -> 's -> 'a io
There are examples of this sort of 'optics + effects' API from the Haskell world (https://hackage.haskell.org/package/lens-action-0.2.4/docs/Control-Lens-Action.html#t:Effective) but I don't know of any in OCaml (it's going to be syntactically heavyweight in OCaml because of the need for functors for higher-kinded polymorphism).
IIUC, the proposal above takes the first two use cases into consideration. Do you have any thoughts about the third use case?
@hannesm. For the third use-case, this scheme would encode that with a large product type as @samoht suggests. I intend to restrict the scope of heterogeneous stores to only statically known type structure, to avoid needing to reason about 'unsealed' extensible type witnesses in the Irmin core (I already find these difficult enough to work with when building the generics themselves).
Not to say that extensible products/sums couldn't be provided; only that I think it introduces more complexity than could reasonably be implemented in a single iteration.
I could create a big record, but would this allow single-key get and set, or would the record be get and set/updated all at once?
I'm not sure exactly what you mean here. The idea is that 'record' tree objects would have the same underlying implementation as standard homogeneous tree objects with (step, node addr) map
contents, so it should be possible to set a single key 'individually', bearing in mind that doing so will change the hash of the blob and therefore necessitate a new tree object being created in the heap (just as before).
thanks for explanation, I'm looking forward to adjust my code using optics.
I was looking for the recipe of how to do this... I have a store where certain key paths should be treated as JSON with a known structure (and custom merge
logic) and other key paths are just opaque strings (effectively immutable)
So +1 on this RFC proposal
But my question is about the current workaround:
type blob = Dromedary of dromedary | Bactrian of bactrian | Cactus of cactus
[@@deriving irmin]
let get_spike () =
Store.get s [ "cacti"; "dangerous"; "spike" ] >|= function
| Cactus c -> c
| _ -> assert false (* never occurs in a well-formed application *)
let set_camelus c =
Store.set ~info s [ "camels"; "bactrian"; "camelus" ] (Bactrian c)
(* must tag with `Bactrian` and promise never to give the wrong tag *)
I can see how this works for get and set, but how does merge
behave in this scenario? If I commit values of different types to a branch and then merge it to master, is there any hope of using the type-specific merge
implementation for each value with the current code?
This is a mirror of a document available at CraigFe/ocaml-generics.
RFC - Heterogeneous Irmin stores via Effectful Optics
Problems with Irmin at present
Irmin is parameterised over a user-supplied data type to be used for representing leaves (blobs) of the store. The blob type
b
must be passed with:b Type.t
, defined by hand via theIrmin.Type
combinators or viappx_irmin
, to allow Irmin to derive its own pretty-printing, hashing and serialisation functions forb
.b option Merge.t
used to resolve blob-level conflicts.With this API, all blobs in the store must have the same OCaml representation. This leads to unpleasant interactions when attempting to store different types of data in different regions of the store. In particular, we have observed two different issues:
Application constraints quite often lead to certain regions of the Irmin store having pre-determined type:
With the current API, the only thing to be done is to make a tagged union and
assert false
in the appropriate places:This works, but has three issues
Dromedary
andBactrian
may have the same run-time representation, but with this API they can never share blobs in the content-addressable heap.The type of a blob may also be determined by a suffix of the path used to access it, for example:
Again, we can get around this at the application level by using a sum type for the blobs:
but the three above are still prevalent. In this particular example, we might particularly care about introducing serialised tags, since it would cause the
.md
files to no longer be parse-able as Markdown files when using a file-system backend for Irmin.Proposed solution
Optical get/set API
A better API to solve problem #1 might look like the following:
where
cacti
,camels
,bactrian
are some form of 'first-class projection' that can be used to get/set a sub-component of a complex type. Such objects are typically referred to as optics. In this case, the 'complex' type is store as a product type defined by:where
'b tree
is a standard Irmin tree object with a single blob type'b
and indexed by a sequence of 'steps' (optics over product types are known as 'lenses').Similarly, we can imagine a better API for the file-system case:
This is effectively the same as what we were doing before, with two differences:
the sum type has been reified into the Irmin store (meaning that the serialised blob need not contain a tag);
the assertion of the file's type is combined with the assertion that it exists at the location
files/<name>.md
(saving some boilerplate code). In this case, themd
value is a 'prism' (optic for sum types) that provides a sum projection (i.e.file -> markdown option
).Types of tree node
To generalise from these examples, the proposal is to generalise from having the user define 'blob' types to defining the type structure of the entire store. We will have three types of tree object, which (might) be nested arbitrarily inside each other:
'a tree -> string -> 'a tree
). This will be the underlying representation for every store (glossing over some details about serialising tags for non-suffix sum types). To the user, it appears like an abstracttree
product type with a pre-defined type combinator and lens constructor:product nodes, defining a fixed number of components (via
Irmin.Type.record
) and accessed via lenses.sum nodes, defining a fixed number of cases (via
Irmin.Type.variant
) and accessed via prisms.This has the advantage of easily reproducing the old behaviour as
b tree
while also allowing arbitrary algebraic type structure to be encoded into the Irmin store via generics. The old API can be preserved as a simplifiedContents
functor that usesb tree
as the store type internally and hides all uses of optics.Monad-parameterised optics
A key advantage of using optics as an Irmin interface is to represent the indirection of the content-addressable store: accessing/updating the field of a 'store'-d record requires access to a content-addressable heap that may have its own effect monad. We can provide an optics library that is parameterised over these effects, and then specialise them for Irmin:
There are many such constructions that achieve the same expressivity: if anyone has an idea of a principled way of selecting one, please let me know :slightly_smiling_face:
Generic construction of optics
It is possible to derive the necessary lenses and prisms with generics, avoiding introducing any additional boilerplate:
Another (more extensive) option might be to provide optic construction for arbitrary generics.
Backend-imposed restrictions
Certain backends might want to impose restrictions / conventions about the type structure contained in the store in ways that make sense for that specific data representation. For example, the Git-compatible backends could restrict sum types to a dedicated 'extension' lens that may only be applied at the end of the path.
The solution should be functorised in such a way as to allow for this use-case.
The big picture
Disregarding backend-specific specialisations, the new API might look something like this:
We need to provide a new API for homogeneous trees too:
Advantages
Limitations
runtime overhead due to generics. Extracting a blob from a tree object will now require generics, making it slower (except in the case where only primitive nodes are used). As always with generics, we might expect significant performance improvements with the use of multi-stage programming.
unable to precisely determine when access is partial. This API introduces something new: certain accesses of an Irmin store may be guaranteed to succeed, since the path from root to blob goes via no sum types. The lens API has no way to track the totality of accesses (without more powerful type-level programming features made possible by https://github.com/ocaml/RFCs/pull/5).