mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.85k stars 157 forks source link

Is it possible to easily use Sync with a content-addressable, apend-only store? #2328

Open kentookura opened 2 months ago

kentookura commented 2 months ago

Hi all,

I have a content-addressable store like so:

module Content_store = Irmin.Content_addressable.Make(Irmin_fs_unix.Append_only)(Irmin.Hash.SHA512)(Content)

I would like to architect my application such that:

It is surely possible to implement this by writing a bunch of custom code, but I was wondering if I was overlooking something in the API to easily enable this. My best lead is to use this module, but it's a bit daunting.

Thanks!

kentookura commented 2 months ago

Just found this: https://ocaml.org/p/irmin-git/latest/doc/Irmin_git/Content_addressable/index.html

kentookura commented 2 months ago

I tried using these modules, but I could not find an easy way, so my question still stands :sweat_smile:

kentookura commented 2 months ago

Sorry for the noisy issue.

Let me further illustrate my use case, as I have hit another snag.

My goal is to create two stores, a KV store for articles, and a content-addressable store for content. I have functions that replace the content of an article with its hash.

Here is a simplified version of my code:

type hash = Irmin.Hash.SHA512.t
let hash_t = Irmin.Hash.SHA512.t

type 'a node = Text of string | Transclude of 'a

and content = Content of content node list

type 'a article =
  { content: 'a;
    iri: iri;  (* https://ocaml.org/p/iri/1.0.0 *)
    metadata: (string * string) list;
  }

let store_content : 'a Content_store.t -> content -> hash = ...

let hash_article : 'a Content_store.t -> content article -> hash article =
  fun store {content; _;} ->
  let hash = store_content store content in
  {content = hash; ...}

The stores are implemented like so (with the appropriate content modules):

module Content_store =
  Irmin.Content_addressable.Make
    (Irmin_fs_unix.Append_only)
    (Irmin.Hash.SHA512)
    (Content)

module Article_store = Irmin_git_unix.FS.KV(Article)

Here we have Article.t = hash article.

The thing that makes all this work is that the Content_store is accessible via Irmin.Hash.SHA512.t, which comes with a representation. This allows me to store the hash in the Article_store. Now it seems that this falls apart when switching to the sync-capable stores:

I have suceeded in creating a git-content-addressable store like so:

module Git_store = Irmin_git.Content_addressable(Git_unix.Make(Digestif.SHA512))
module Content_store = Git_store.Make(Content)

The issue is that this store is no longer indexed by Irmin.Hash.SHA512s, but rather by Digestif.SHA512, which are not compatible.

So ultimately, my question is about how to proceed. Should I write a representation for the abstract type Digestif.SHA512.t, or is there some way around this issue that I am failing to see?

Of course, I am also interested in any other comments, feedback or advice.

Also, thanks for the work on this amazing library! :heart:

For reference, here is the real code:

https://git.sr.ht/~jonsterling/ocaml-forester/tree/16cd57f4bffaa203cc4dcaca22c1fb05d343e913/item/forester-ui/lib/Cache.ml

kentookura commented 2 months ago

@art-w You gave an example of how to implement a repr for an abstract type here: https://github.com/mirage/repr/issues/108 I have been trying to work through the example, but am now stuck at implementing the binary encoders and decoders for the hashes:

module Hash : Irmin.Contents.S with type t = Digestif.SHA512.t = struct
  type t = Digestif.SHA512.t

  let t : Digestif.SHA512.t Repr.ty =
    let open Digestif.SHA512 in

    let pp h hash =
      Format.fprintf h "%s" (to_raw_string hash)
    in

    let of_string str =
      try Scanf.sscanf str "%s" (fun str -> Ok (of_raw_string str))
      with _ -> Error (`Msg "of_string")
    in

    let encode_json encoder hash =
      let (`Ok | _) = Jsonm.encode encoder (`Lexeme `Os) in
      let (`Ok | _) = Jsonm.encode encoder (`Lexeme (`String (to_raw_string hash))) in
      let (`Ok | _) = Jsonm.encode encoder (`Lexeme `Oe) in
      ()
    in
    let decode_json decoder =
      (* Must be able to decode the output of [encode_json] *)
      let (let*) = Result.bind in
      let* () = Repr.Json.decode decoder |> function `Lexeme `Os         -> Ok () | _ -> Error (`Msg "decode_json") in
      let* v  = Repr.Json.decode decoder |> function `Lexeme (`String v) -> Ok v  | _ -> Error (`Msg "decode_json") in
      let* () = Repr.Json.decode decoder |> function `Lexeme `Oe         -> Ok () | _ -> Error (`Msg "decode_json") in
      Ok (of_raw_string v)
    in

    (*This is as far as I got*)

    let encode_bin (hash : Digestif.SHA512.t) write =
      let encoder = Repr.Binary.Bytes.encode (`Fixed digest_size) |> Repr.Staging.unstage (*???*)in
      write hash
    in

    let decode_bin buffer at =
      (* Must be able to decode the output of [encode_bin] *)
      let len = Repr.Binary.Varint.decode buffer at in
      let v = String.sub buffer !at len in
      at := !at + len ;
      v 
    in
    let size =
      Repr.Size.custom_dynamic
        ~of_value:(fun foo ->
          (* Optional, precompute the size that will be used by [encode_bin] *)
          let len = String.length foo in
          let varint_len_size = ref 0 in
          Repr.Binary.Varint.encode len
            (fun s -> varint_len_size := !varint_len_size + String.length s) ;
          !varint_len_size + len)
        ~of_encoding:(fun str offset ->
          (* You can skip this function, it's unused nowadays *)
          let at = ref offset in
          let len = Repr.Binary.Varint.decode str at in
          let varint_len_size = !at - offset in
          varint_len_size + len)
        ()
    in
    let pre_hash =
      (* Same as [encode_bin], unless the binary serialization has changed
         but we want to preserve the previous cryptographic hash for
         backward compatibility. *)
      encode_bin
    in

    let equal = equal in
    let compare = unsafe_compare in
    (* let equal a b = String.equal a.v b.v in *)
    (* let compare a b = String.compare a.v b.v in *)
    let short_hash ?seed foo =
      match seed with
      | None -> Hashtbl.hash foo
      | Some seed -> Hashtbl.seeded_hash seed foo
    in
    Repr.abstract
      ~pp ~of_string
      ~json:(encode_json, decode_json)
      ~bin:(encode_bin, decode_bin, size)
      ~pre_hash
      ~equal ~compare ~short_hash
      ()

  let merge = Irmin.Merge.(option (idempotent t))
end
kentookura commented 2 months ago

I've succeeded in writing the repr for the hash type.

I now have the following module structure:

module Git_store = Irmin_git.Content_addressable(Git_unix.Make(Digestif.SHA512))
module Content_store = Git_store.Make(Content)

where Content_store has the following interface:

sig
  type 'a t = 'a Content_store.t
  type key = hash
  type value = Content.t
  val mem : [> Irmin.Perms.read ] t -> key -> bool
  val find : [> Irmin.Perms.read ] t -> key -> value option
  val add : [> Irmin.Perms.write ] t -> value -> key
  val unsafe_add : [> Irmin.Perms.write ] t -> key -> value -> unit
  val close : 'a t -> unit
  val batch : Irmin.Perms.read t -> ([ `Read | `Write ] t -> 'a) -> 'a
end

I am now wondering how to actually obtain a store? The docs say: Values will be stored into .git/objects., how do I obtain a value of type 'a Content_store.t?

Furthermore, I still don't know how to obtain a syncable store from the content-addressable git store.

art-w commented 2 months ago

Hello! Thanks for the report, we are looking into the lack of constructor for the content-addressable store as it seems like an oversight...

In the meantime, regarding your previous questions:

Did you consider Repr.(map string) of_raw_string to_raw_string to reuse the existing string Repr.t instead of implementing all the abstract interface by yourself? (sorry for the suggestion if you already tried but I would love to know why you prefer the low-level alternative then? ^^)

My goal is to create two stores, a KV store for articles, and a content-addressable store for content. I have functions that replace the content of an article with its hash.

Oh this is actually how Irmin is implemented! What is your motivation for storing the content separate from the iri link / metadata? It's a shame that irmin-git lacks customizable metadata like the other backends due to its compatibility with git, but you could probably store both together (?) or split it into two entries in the KV store? (with one folder for contents, one folder for articles)

(I'm asking because I've a small doubt that your separate CA store will be replicated by git fetch algorithm, since it doesn't "see" that the contents are referenced from the articles by their hash... but I'll check if it works and update shortly)

art-w commented 2 months ago

As I was afraid, the objects stored in the git repo via Irmin_git.Content_addressable are free-standing: As there are not reachable from any commit by git algorithms, they don't get replicated from/to remotes when using fetch/push.

Can you tell me a bit more about why you wanted to use Content_addressable directly instead of just Irmin KV?

dinosaure commented 2 months ago

It's a shame that irmin-git lacks customizable metadata like the other backends due to its compatibility with git

Just to the record, you can associate metadata with a commit (like a gpg key, for example) and be compatible with Git. The latter are not “scannable” (in the sense that they are only values, not pointers; it seems to me that Git LFS uses these metadatas) but it may be worthwhile for Irmin to make use of this possibility.

art-w commented 2 months ago

Oh neat thanks for the tip! So we could lift this type restriction to allow users to store any metadata with their commits! But this wouldn't help @kentookura though as the git files metadata still appear to be restricted to file kind/permissions :/

kentookura commented 2 months ago

Did you consider Repr.(map string) of_raw_string to_raw_string to reuse the existing string Repr.t instead of implementing all the abstract interface by yourself? (sorry for the suggestion if you already tried but I would love to know why you prefer the low-level alternative then? ^^)

Ah, I didn't know about Repr.map... That is indeed much simpler.

Oh this is actually how Irmin is implemented! What is your motivation for storing the content separate from the iri link / metadata? It's a shame that irmin-git lacks customizable metadata like the other backends due to its compatibility with git, but you could probably store both together (?) or split it into two entries in the KV store? (with one folder for contents, one folder for articles)

We (pinging @jonsterling) want to enable what we are calling "federated forestry". In short, we want to enable our users to transclude the content of their collaborators. Forester is tag-based, meaning that each tree has a permanent unique identifier. On the other hand, the content of the tree might change over time, so the architecture I envision will allow us to do stuff like notifying the user if remote content has changed.

Now as for the reason for splitting the stores up like this, I thought it was a reasonable approach just due to the limitations the API imposes on me as a programmer. One content type should be indexed by the unique identifiers, where as the other content type should only ever be referred to by hash. If I just put articles in the store, then the content is not retrievable by itself, right?

kentookura commented 2 months ago

Getting more into the weeds, we have this notion of content target that controls what is actually being transcluded, so it would be nice for the store to be able to directly store the smallest unit of transclusion.

That is to say: My reason for not simply storing the unmodified articles is that (I think/assume that) storing the article will not allow me to retrieve a subcomponent of the article by hash?

If I am wrong, does this mean that when I have a representation of a type that contains other representations, are all components of this type retrievable via hash?

art-w commented 2 months ago

Thanks for the details, I love that you are bringing transclusions to federations!

I could not find a reference to hashes in the code you linked, so reading between the lines I'm guessing the content hash is intended to track if the transcluded article has been updated since, to inform the user of the change? (Please correct me if I'm wrong!)

A couple of thoughts, in hope that they inspire you some ideas even I turn out to have completely misunderstood your problem: (sorry the examples got very long)

type hash = Digestif.SHA1.t (* = the default hash algorithm used by [Irmin_git_unix.FS.KV] *)
let hash_t = Digestif.SHA1.(Repr.(map string) of_raw_string to_raw_string)

type transclusion = { tag : string; expected_hash : hash } [@@deriving irmin]
type node = Text of string | Transclude of transclusion [@@deriving irmin]
type content = Content of node list [@@deriving irmin]

type article = { content : content; metadata : (string * string) list }
[@@deriving irmin]

module Store = Irmin_git_unix.FS.KV (struct
  type t = article [@@deriving irmin]

  let merge = Irmin.Merge.(option (idempotent t))
end)

open Lwt.Syntax

(* these type definition with [@@deriving irmin] are only there for debug prints *)
type transclusion_error =
  [ `Deleted of article
  | `Updated of article * article
  | `Original_article_lost of article
  | `Unknown_article of string ]
[@@deriving irmin]

type transcluded_result = (article, transclusion_error) result
[@@deriving irmin]

let get_transcluded repo branch { tag; expected_hash } =
  let* article = Store.Contents.of_key repo expected_hash in
  let* latest_hash = Store.hash branch [ tag ] in
  match (article, latest_hash) with
  | Some article, Some latest_hash when latest_hash = expected_hash ->
      (* transcluded article is uptodate *)
      Lwt.return (Ok article)
  | Some article, None ->
      (* we still have access to the originally transcluded article contents,
         but can inform the user that it was deleted in more recent versions *)
      Lwt.return (Error (`Deleted article))
  | Some old_article, Some latest_hash ->
      (* inform the user that the article was updated since the transclusion *)
      let+ latest_article =
        Lwt.map Option.get @@ Store.Contents.of_key repo latest_hash
      in
      Error (`Updated (old_article, latest_article))
  | None, Some latest_hash ->
      (* should not happen... indicates that the git history
         was rewritten + git gc was run to force-fully erase
         the previous article contents from existence *)
      let+ latest_article =
        Lwt.map Option.get @@ Store.Contents.of_key repo latest_hash
      in
      Error (`Original_article_lost latest_article)
  | None, None ->
      (* should not happen either, either history rewrite like above
         or it might indicate that we are trying to transclude from
         an unknown remote *)
      Lwt.return (Error (`Unknown_article tag))

let date () = Int64.of_float (Unix.gettimeofday ())

let store_article t tag article =
  let message = Printf.sprintf "Update %s" tag in
  let info () = Store.Info.v ~author:"art-w" ~message (date ()) in
  Store.set_exn t ~info [ tag ] article

let main () =
  let* repo = Store.Repo.v @@ Irmin_git.config "/tmp/test-remote" in
  let* main = Store.main repo in

  (* create a first article *)
  let aw_0001 = { metadata = []; content = Content [ Text "first" ] } in
  let* () = store_article main "aw-0001" aw_0001 in

  (* compute hash of article, either from the article or its tag: *)
  let hash_aw_0001 = Store.Contents.hash aw_0001 in
  let* hash_aw_0001' = Store.hash main [ "aw-0001" ] in
  assert (hash_aw_0001' = Some hash_aw_0001);

  (* create a second article with a transclusion *)
  let aw_0002 =
    {
      metadata = [];
      content =
        Content [ Transclude { tag = "aw-0001"; expected_hash = hash_aw_0001 } ];
    }
  in
  let* () = store_article main "aw-0002" aw_0002 in

  (* when computing the transclusion to render the second article: *)
  let* tr =
    get_transcluded repo main { tag = "aw-0001"; expected_hash = hash_aw_0001 }
  in
  Format.printf "%a@." (Irmin.Type.pp transcluded_result_t) tr;
  (* => prints "Ok article_contents" since the transclusion is uptodate *)

  (* if we update the first article, then its hash will change: *)
  let aw_0001_upd = { metadata = []; content = Content [ Text "UPDATED" ] } in
  let* () = store_article main "aw-0001" aw_0001_upd in
  assert (Store.Contents.hash aw_0001_upd <> hash_aw_0001);

  (* so if we retry to render the second article, we get a warning that the transcluded content has been updated *)
  let* tr =
    get_transcluded repo main { tag = "aw-0001"; expected_hash = hash_aw_0001 }
  in
  Format.printf "%a@." (Irmin.Type.pp transcluded_result_t) tr;
  (* => prints "Error (`Updated (old_article, new_article))" *)

  (* to accept the changes, update the second article with the new expected_hash *)
  Lwt.return_unit

let () = Lwt_main.run (main ())

A difference between this and your previous CA store approach is that the articles are reachable from the git history and so their hashes will be synced by fetch/pull/push automatically.


(I think/assume that) storing the article will not allow me to retrieve a subcomponent of the article by hash? when I have a representation of a type that contains other representations, are all components of this type retrievable via hash?

If I understand you correctly then no, if you store some Contents.t in Irmin then the sub-values reachable by traversing your Contents.t OCaml value are not themselves accessible by their content hash. You have to get your contents from the store, and then do the traversal in OCaml if you were only interested in a sub-part of that value. (but see below for a way to achieve this nevertheless!)


we have this notion of content target that controls what is actually being transcluded, so it would be nice for the store to be able to directly store the smallest unit of transclusion.

I don't know if that's actually required considering that your article contents are expected to be small (measured in ~kb?) before their transcluded articles are "inlined" (which I assumed is only done when rendering to xml). At this small scale it's unlikely to bring any performance benefit to directly access the smallest unit of transclusion (but it has a cost since it implies more content-hash are indexed). But in case you really need it, a solution is to store mixed-contents into your Irmin store and to break your article values into multiple entries:

(* ... *)
type article = {
  content : content;
  title : string;
  metadata : (string * string) list;
}
[@@deriving irmin]

(* to represent the fact that we are going to store the sub-components of an article in irmin: *)
module Internal = struct
  type t =
    | Content of content
    | Title of string
    | Metadata of (string * string) list
  [@@deriving irmin]

  let merge = Irmin.Merge.(option (idempotent t))
end

module Store = Irmin_git_unix.FS.KV (Internal)

let date () = Int64.of_float (Unix.gettimeofday ())

(* utility function to split an [article] into its sub-components before storing it *)
let store_article db ~tag { content; title; metadata } =
  let message = Printf.sprintf "Update %s" tag in
  let info () = Store.Info.v ~author:"art-w" ~message (date ()) in
  Store.set_tree_exn ~info db [ tag ]
  @@ Store.Tree.of_concrete
  @@ `Tree
       [
         ("title", `Contents (Internal.Title title, `Normal));
         ("metadata", `Contents (Internal.Metadata metadata, `Normal));
         ("content", `Contents (Internal.Content content, `Normal));
       ]

open Lwt.Syntax
open Lwt.Infix

(* utility functions to query only a sub-component from an article *)
let get_title db tag =
  Store.get db [ tag; "title" ] >|= function
  | Internal.Title title -> title
  | _ -> invalid_arg "expected title"

let get_metadata db tag =
  Store.get db [ tag; "metadata" ] >|= function
  | Internal.Metadata metadata -> metadata
  | _ -> invalid_arg "expected metadata"

let get_content db tag =
  Store.get db [ tag; "content" ] >|= function
  | Internal.Content content -> content
  | _ -> invalid_arg "expected content"

(* or fetch everything at once: *)
let get_article db tag =
  let* root = Store.get_tree db [ tag ] in
  let+ title = Store.Tree.get root [ "title" ]
  and+ metadata = Store.Tree.get root [ "metadata" ]
  and+ content = Store.Tree.get root [ "content" ] in
  match (title, metadata, content) with
  | Internal.Title title, Internal.Metadata metadata, Internal.Content content
    ->
      { title; metadata; content }
  | _ -> invalid_arg "type error"

(* test *)
let main () =
  let* repo = Store.Repo.v @@ Irmin_git.config "/tmp/test-remote" in
  let* main = Store.main repo in

  let aw_0001 =
    {
      title = "first";
      metadata = [ ("date", "2024-08-03") ];
      content = Content [ Text "test" ];
    }
  in
  let* () = store_article main ~tag:"aw-0001" aw_0001 in

  let* t = get_title main "aw-0001" in
  Format.printf "Title: %S@." t;

  let* art = get_article main "aw-0001" in
  Format.printf "Article: %a@." (Irmin.Type.pp article_t) art;

  Lwt.return_unit

let () = Lwt_main.run (main ())

As you can see, this requires a bit of setup with utility functions to store and get stuff from Irmin. (In the future, we would like to improve on the type-safety in Irmin API to remove the need for invalid_arg "expected <subtype>" so let us know if you choose that option!)

When combined with the previous code example, this splitting will allow you to differentiate between the "article hash" (= content+metadata) and the "content hash" (which is what you were trying to get the hash of in your initial post?). I do not understand why the content hash is more important than the rest, as it seems to me that the title/author/etc changes are also relevant if they happen to change (?) and those are only captured by the "article hash"?

(When splitting, your article becomes a "folder" but you can still get its merkle hash (which captures the hash of all of its children title/metadata/content) with Store.hash t [tag])


Sorry this got so long... Do not hesitate to correct my misunderstandings or share additional details/constraints about your system :)

kentookura commented 2 months ago

I could not find a reference to hashes in the code you linked, so reading between the lines I'm guessing the content hash is intended to track if the transcluded article has been updated since, to inform the user of the change? (Please correct me if I'm wrong!)

Ah, we have several branches that we are working on: https://git.sr.ht/~jonsterling/ocaml-forester/refs My work is on this branch

At this point I'd really like for @jonsterling to pitch in :)

Sorry this got so long... Do not hesitate to correct my misunderstandings or share additional details/constraints about your system :)

This is extremely helpful, nothing to be sorry for!

I will come back to you once I've explored the options a bit.

kentookura commented 2 months ago

@art-w I am wondering if it would be helpful to you if I report any issues I encounter when using the eio branch of irmin, or if it is too early to experiment with that branch.

art-w commented 2 months ago

Yes yes! Any feedback or issue report on the Eio PRs would be extremely welcome! We are planning a minor release with bugfixes for the current lwt version, but then the eio branch will become the main one. If you are already experimenting with it, I think you should use the version from https://github.com/mirage/irmin/pull/2316 (so the follow-up on the initial eio port, which fixes some known issues).

(Thanks for your links btw, I see a lot more content hashes there but haven't had time to dig into it yet :) )

kentookura commented 2 months ago

@art-w Using mirage-crypto-rng-eio takes up to 19 seconds to start, but it works. I'll try to create a reproduction example sometime.

kentookura commented 2 months ago

@art-w @jonsterling Do you think it is possible/practical to use Sync to track multiple remotes using a single store?

We are introducing the notion of a "host" to forester to prepare for some features we hope to implement in irmin: https://git.sr.ht/~jonsterling/ocaml-forester/commit/a2c0537b0b61aa7270ea40fbea2c0d998c6addf6

Suppose the user configures some remote hosts foo and bar, which point to some irmin stores served via http. The question I have is if it is possible to use Sync to maintain directories "foo" and "bar" in my local store which are kept in sync with the remotes, or if it is better to just set up multiple stores.

PS.: If you are interested, the latest changes in our work regarding irmin are here: https://git.sr.ht/~jonsterling/ocaml-forester/log/irmin-datastore

art-w commented 2 months ago

Using mirage-crypto-rng-eio takes up to 19 seconds to start

Hm that is concerning. Sorry for the late reply, but I could not reproduce nor see anything in the code that could explain this so I didn't know how to respond... Can you confirm that the following is slow on your machine? (it takes at most 1ms on mine)

let () =
  Eio_main.run @@ fun env ->
  let t0 = Eio.Time.now env#clock in
  Mirage_crypto_rng_eio.run (module Mirage_crypto_rng.Fortuna) env @@ fun () ->
  let t1 = Eio.Time.now env#clock in
  Format.printf "%fs@." (t1 -. t0)

And that the following is not? (with the mirage-crypto-rng.unix dependency)

let () = Mirage_crypto_rng_unix.initialize (module Mirage_crypto_rng.Fortuna)

In which case, it should be fine for you to use the Mirage_crypto_rng_unix initialization instead of the Eio one, and we'll bubble the issue to the competent authorities to investigate why the eio version is so slow...


It should be possible to support the directory replication that you describe for the remotes in the local store, by using fetch to acquire the latest commit from the remote and updating the corresponding folder. In pseudo-code:

Store.set_tree_exn ~info local_store ["remotename"] @@ Commit.tree @@ Result.get_ok @@ Sync.fetch remote

But I think it'll be easier and more robust to use branches instead (like we would do when using git with remotes), as that would also preserve commit history of the remote:

Store.Branch.set local_store "remotename" @@ Result.get_ok @@ Sync.fetch remote

(Sorry I haven't had time to check your code, I'll try to update as soon as I can have a look)