stedolan / crowbar

Property fuzzing for OCaml
MIT License
183 stars 28 forks source link

Design suggestion: smaller core, compositional constructors #3

Closed yallop closed 7 years ago

yallop commented 7 years ago

I'm very much looking forward to using this library across quite a few projects! The first one I have in mind is profuse, where the central data structure has some interesting invariants, fairly straightforward to express, but tricky to check. Crowbar seems like an ideal tool for such cases. I'm also looking forward to using it in a serialisation/rpc library under development at the moment.

In trying to understand the design of crowbar, I've made a few notes about things that it might be useful to express slightly differently. These are currently just suggestions, sometimes quite rough, but if you like the sound of any of them, I'm happy to turn them into pull requests. This first issue is mostly about implementation, but I think it spills over into the interface a bit, too.

datatype-centric vs compositional design

The central gen type is currently defined as a datatype, with constructors for lists, options, non-empty lists, etc. There are sometimes reasons to be cautious about defining things this way, so I'd like to suggest exploring an alternative approach.

Since there's essentially only one operation defined over the gen type, it should be straightforward to define it in a more compositional/extensible style. (This is really just the old distinction between oo-style, which makes it easy to add new types and hard to add new operations, and typed-fp-style, which makes it hard to add new types and easy to add new operations.)

In other words, rather than a definition of this form

type 'a gen =
  ...
  | Option : 'a gen -> 'a option gen
  | List : 'a gen -> 'a list gen
  | List1 : 'a gen -> 'a list gen
  ...

along with a function like this, where generation is defined by case analysis:

let rec generate : type a . int -> state -> a gen -> a * unit printer =
  fun size input gen -> match gen with
   (* ... *)
  | Option gen ->
     if read_bool input then
       let v, pv = generate size input gen in
       Some v, fun ppf () -> pp ppf "Some (%a)" pv ()
     else
       None, fun ppf () -> pp ppf "None"
  | List gen ->
   (* etc. *)
  | List1 gen ->
   (* etc. *)

I think it'd be worth considering a definition more like this:

type +'a gen = bool stream -> bool stream * 'a

along with a separate function for each constructor

let option = (* ... *)

let list = (* ... *)

There are three main benefits to doing things this way.

First, it's easy to add new constructors, either inside or outside the library, since list and option are just functions with no special status rather than constructors of a distinguished types.

Second, it keeps things compositional. With a datatype definition it's always tempting to do case analysis on subvalues, which tends to lead to uneven behaviour. (Indeed, the current definition of the generate function does inspect subvalues in at least the Choose case.) With the separate-function design it's essentially impossible to analyse subvalues, so the semantics is almost certain to end up cleaner / more "denotational".

Third, it keeps the core small. With the separate-function design the core can be tiny (see below), which makes it easier to make interesting changes to it, and easier to ensure that the core is correct. It's also a useful test of the design: if functions like list and option can easily be built on top of the core then it's likely that almost any other user-defined function can be built that way, too; on the other hand, if you start with a larger core, with everything you need provided as a built in then there's nothing to ensure that the langauge will be easy to extend outside the library, and it's more likely that you'll have to extend the core with additional operations later.

design sketch

Here's a sketch of such a design. (It's entirely possible that I've misunderstood some details -- if so, I'm keen to understand which -- but I expect the general idea can be made to work.)

The central idea is generating data from binary input, so it's natural to express the core as a monad with a single operation that reads and returns one bit, like this:

module Generate :
sig
  type +'a t
  val return : 'a -> 'a t
  val (>>=) : 'a t -> ('a -> 'b t) -> 'b t

  val bit : bool t

  val run : 'a t -> bool stream -> bool stream * 'a
end

Then everything else can be built on top of the monad operations plus bit, and so lives outside of the core. (I think it's possible to implement this interface fairly efficiently, but it's possible that operations that read multiple bits at a go might also be useful.)

The stream here is intended to be functional, where reading an element doesn't change the state of the stream, rather than imperative/OCaml-style, where Stream.next returns a new element each time. Here's a suitable definition in the even style:

type 'a stream_ = Stream of 'a * 'a stream
and 'a stream = 'a stream_ Lazy.t

Then run can be invoked with some source of bits -- either a PRNG for quickcheck-style checking, or a bit vector that is iteratively perturbed by the afl framework for fuzzing. (Again, since the source of bits is abstracted, it's straightforward to add alternative sources later, even outside the library.)

The combination of the monad operations and bit should be sufficiently expressive to define more building blocks, e.g.

val (<*>) : 'a t -> 'b t -> ('a * 'b) t
(** Combine two generators *)

val (<|>) : 'a t -> 'b t -> ('a, 'b) either t
(** Pick a generator based on the next bit in the stream *)

val option : 'a t -> 'a option t
(** Build a generator for [t option] from a generator for [t] *)
stedolan commented 7 years ago

So, Crowbar already provides a monadic interface in more-or-less this style, plus some other junk. I'll try to explain what the other junk is for, but I'm open to alternative designs.

The (>>=) monadic operation can be separated into (arguably) more primitive map and join operations, so we can also write the basic monadic interface like this:

  type +'a t
  val return : 'a -> 'a t
  val map : ('a -> 'b) -> 'a t -> 'b t
  val join : 'a t t -> 'a t

By far the most common application is to define a generator as a function of the result of a few others, for instance, defining a generator of {x : int; y : int; z : int} in terms of one for int. This can be done using >>=:

let xypair = int >>= (fun x -> int >>= (fun y -> int >>= (fun z -> {x; y; z})))

but I think that the cleanest option (and the one that requires the least nesting) is to generalise map to multiple arguments:

let xypair = map (fun x y z -> {x ; y ; z}) [int; int; int]

(Getting this to work involves mild GADT trickery abusing the :: constructor, but I think it's quite clear in use even if the internals are a bit arcane. In Crowbar, I swap the arguments, which makes things more readable by putting the generators int; int; int close to the binders x y z).

To the basic monadic interface, we can add a bunch of derived combinators, e.g:

  val option : 'a t -> 'a option t
  val list : 'a t -> 'a list t
  val choose : 'a t list -> 'a t

I strive to avoid combinators like <*> and <|> which over-emphasise arity 2 - you end up spending far too much time disentangling commas in a, (b, (c, d)). Instead, I prefer combinators like choose and the multiple-argument map which work just as cleanly for arity 3, 4, 5 as for 2.

At this stage, there's a simple monadic core (based on join and multi-argument map, from which >>= is a derived and not-too-often-used combinator), on top of which option, list, and the like are easy defined.

But in order to write generators for interesting structures, we need to handle recursion. OCaml refuses to compile the following:

let rec expr = choose [
  map [int] (fun i -> `Int i);
  map [expr; expr] (fun e1 e2 -> `Plus e1 e2)
]

because the right-hand side could do anything. The cheap trick that Crowbar uses currently is to lift the builtin forms for constructing generators to constructors, since OCaml happily accepts:

let rec expr = Choose [
  Map ([int], fun i -> `Int i);
  Map ([expr; expr], fun e1 e2 -> `Plus e1 e2)
]

This particular example could be expressed cleanly enough with fixed-point combinator, something like:

val fix : ('a t -> 'a t) -> 'a t

However, real-world examples are not so simple, and involve complex mutual recursion. While mutual recursion can be expressed using a single fixpoint combinator, uses of that fixpoint combinator are not pretty. By contrast, this partial generator for OCaml parse trees looks very like the type definitions that it's generating.

Still, it's definitely unsatisfying that Crowbar gives e.g. List a higher status than anything user-definable (You can get most of the same effects with Join and a custom generator, but not quite all). So, I'm definitely interested in alternative designs, but I consider it a requirement that it be possible to express complicated mutual recursions with code that's no uglier than at the moment (that is, it's not enough to have a fixpoint combinator which in principle expresses everything - that expression must also be readable).

I have a feeling that this can probably be done with creative abuse of Lazy, since that's the other means (as well as constructors) of sneaking code into let rec, but I haven't worked out the details yet.

dbuenzli commented 7 years ago

FWIW react's signal fix point combinator signature allows to express complex mutual recursion, you just have to let the user escape arbitrary info from the fix point function. However it is horrible to write and use. I have been thinking about maybe replacing this with a delay : 'a S.t lazy -> 'a S.t combinator.

samoht commented 7 years ago

For fix points, I found the (somehow verbose) mu* functions in Depyt easy to use in general: https://samoht.github.io/depyt/doc/Depyt.html#VALmu, e.g:

      type r = { foo: int; bar: string list; z: z option }
      and z = { x: int; r: r list }

      (* Build the representation of [r] knowing [z]'s. *)
      let mkr z =
        record "r" (fun foo bar z -> { foo; bar; z })
        |+ field "foo" int (fun t -> t.foo)
        |+ field "bar" (list string) (fun t -> t.bar)
        |+ field "z" (option z) (fun t -> t.z)
        |> sealr

      (* And the representation of [z] knowing [r]'s. *)
      let mkz r =
        record "z" (fun x r -> { x; r })
        |+ field "x" int (fun t -> t.x)
        |+ field "r" (list r) (fun t -> t.r)
        |> sealr

      (* Tie the loop. *)
      let r, z = mu2 (fun r z -> mkr z, mkz y)
yallop commented 7 years ago
let xypair = int >>= (fun x -> int >>= (fun y -> int >>= (fun z -> {x; y; z})))

It's possible to omit the parentheses here, which makes the code a bit more readable -- rather like do notation:

let xypair = int >>= fun x ->
             int >>= fun y ->
             int >>= fun z ->
             return {x; y; z}

but I think that the cleanest option (and the one that requires the least nesting) is to generalise map to multiple arguments:

let xypair = map (fun x y z -> {x ; y ; z}) [int; int; int]

Another possibility, which I personally find quite readable, is to use applicative style (defineable in terms of the monadic combinators), like this:

let xypair = pure (fun x y z -> {x ; y ; z}) <$> int <$> int <$> int

where

let pure = return
let (<$>) f x = f >>= fun g -> x >>= fun y -> return (g y)

I think there are good arguments for all three notations. And (ignoring the issues with let rec for the moment), I'd think it would be possible to define all three on top of a monadic core, although I haven't actually checked for Map.

stedolan commented 7 years ago

@samoht

For fix points, I found the (somehow verbose) mu* functions in Depyt easy to use in general

I tend to agree with @dbuenzli that these are "horrible to write and use", at least at scale. Depyt provides only mu and mu2 (for which the syntax noise isn't awful), but the example of a generator of valid OCaml parsetrees has twelve mutually recursive generators. This is quite easy to read with let rec, and mirrors the definitions in parsetree.mli, but would be awful with mu12.

@yallop

Yes, a simple monadic core would be good, but I don't think that return + >>= is any simpler a core than pure + map + join. (I recall a long-running argument in Haskell-land about this :) ). Multi-argument Map is exactly Applicative, Map and (pure, <$>) can express each other, and a monad is a functor with pure and join.

A bigger issue is that Crowbar is currently defined by a big fixed syntax tree, when as you point out an abstract type and some primitives would be better and easier to extend in user code. What's blocking this at the moment is a good way of defining complicated mutually recursive data types. I don't think this problem becomes any easier if we replace map / join with return/>>=.

stedolan commented 7 years ago

OK, handling recursive generators using lazy was easier than I imagined. Opinions on the API in #5?

stedolan commented 7 years ago

Fixed by #5