Built-in generation and serialization functionality

samcowger commented 6 years ago

Incorporate automatic derivation of generation (from thin air to memory/Haskell structures) and serialization (from memory/Haskell structures to disk) functions from PADS descriptions, to create meaningfully-formed examples of random data. Much of the generation functionality currently defaults to uniform distribution of random values across the range dictated by the type, or to a range somewhat more strict than the type (e.g. Strings include only English upper- and lowercase letters and are limited in their potential length). In some cases (currently: "obtain" declarations and individual record fields), users can provide their own generation functions to override those that would otherwise be derived.

ntc2 commented 6 years ago

Here's more context for whoever reviews this PR, but @samcowger should know better in case what I say is not accurate.

One use case for random generation we have in mind is fuzz testing.

In the conference call on 2018-07-24 we mostly talked about generation in context of producing "realistic" data. I imagine a more common use case for typical PADS users is random testing, a-la QuickCheck. For the random testing use case, the ability to easily generate random data seems important.

Everything here is backwards compatible in terms of PADS syntax.

The only syntax additions are an optional generator <| <haskell code> |> suffix on obtains and record fields.

The generator keyword for record fields is not strictly necessary, but is important pragmatically. For example, consider this simple packet type:

[pads|
  data Pkt = Pkt { size :: Word32 generator <| uniform (0,100) |> 
                 , body :: Bytes size }
|]

Without overriding the generation of the size field, the default uniform generation for Word32 would yield 2+ GB bodys most of the time. We're working on an approach to allow runtime overrides of the default generation, but overriding the default statically is probably simpler in cases where it's sufficient.

The generator keyword on obtain is necessary, since types related by obtain need not be isomorphic. For example:

[pads|
  type MyDate = obtain Date from StringC using <| (mkPartialConv s2d, mkTotalConv d2s) |>
|]

--| Not all 'String's are valid 'Date's.
s2d :: String -> Maybe Date
s2d [y1,y2,y3,y4,'-',m1,m2,'-',d1,d2] = Just ...
s2d _ = Nothing

-- | But all 'Date's can be injected into 'String'
d2s :: Date -> String
d2s d = ...

The point is that being able to generate StringC (the PADS type) doesn't allow to generate Date (the injected Haskell type).

Many of the early commits related to generators include code that eventually got scrapped when we switched to a TH based approach, so don't worry about understanding non-TH generation code in them.

The TH based approach to generators starts at 3a4f60f. Some commits before that related to serialization are probably still relevant.

The pre-TH generator code is the continuation of the implementation started by Jared and Sam last summer (?). Sam and I concluded that a non-TH approach to generators would not scale to all of PADS, because PADS descriptions can incorporate arbitrary Haskell expressions in <| ... |> brackets; without TH we'd need to write an interpreter for Haskell.

The new TH code can be understood by analogy with existing TH code.

I.e. generation is analogous to parsing, and serialization is analogous to pretty printing.

cronburg commented 6 years ago

Looks good to me!

padsproj / pads-haskell