Research and optimize `Data` serialization code

Issue by: kozross Original date: 2023-06-14 21:15:43 UTC Originally opened as: mlabs-haskell/trustless-sidechain/issues/484 Original assignees: kozross Status on 2023-06-20: open

Description

Follow-on from input-output-hk/trustless-sidechain#426. Currently, we're seeing significant size blowouts when comparing scripts measured as CompiledCode versus their serialized forms. This could be due to the 'bundling' of Data deserialization code: we frequently use autogenerated instances, which are suboptimal in some cases, many of which we encounter. For example, product types are always encoded as Constr, even though we end up carrying around a tag which we never need, but still have to store and match on. Furthermore, instead of re-using fromBuiltinData in UnsafeToData, the generated code duplicates this functionality, causing much more duplication than necessary.

This somewhat supercedes input-output-hk/trustless-sidechain#477 and encompasses some parts of input-output-hk/trustless-sidechain#480.

Goals

[ ] Test whether changing Data-related instances improves serialized script size
[ ] If it does, see how small a reduction we can obtain
[ ] Document this in a guide!

Tests

As these instances are now manually written, some additional tests should be written. These should verify the following all hold:

[ ] fromBuiltinData . toBuiltinData = Just
[ ] unsafeFromBuiltinData . toBuiltinData = id

It would also be good to include some tests that 'bad' encodings fail to deserialize, but these are type-specific and may not always be practical. QuickCheck is appropriate for such tests.

Measurements of just the Data-related methods would probably be good to have also.

Related issues/PRs

input-output-hk/trustless-sidechain#63
input-output-hk/trustless-sidechain#60
input-output-hk/trustless-sidechain#558

Research result

Plutus scripts include data decoders for the data types they use, adding to script sizes. Generally, we can reduce sizes by changing all product type representations:

data Foo = Foo { x :: Integer,
                 y :: BuiltinByteString
               }

The default serialisation (by PlutusTx) is Constr 0 [x, y], while the better way is to just serialise it as [x, y] .

With this generally applicable optimisation only, we can reduce the sizes by a considerable margin:

    fromBuiltinData:                                 OK
      Target: generated; size 828
      Measured: handwritten; size 797
      Remaining headroom: 31

Script size changes (optImised internal data types only):

Size
  Core
    mkMintingPolicy (FUEL):                          OK
      Size: 1039
    mkMintingPolicy (FUEL) serialized:               OK
      Remaining headroom: 30
    mkMintingPolicy (MerkleRoot):                    OK
      Remaining headroom: 42
    mkMintingPolicy (MerkleRoot) serialized:         OK
      Remaining headroom: 84
    mkCommitteeCandidateValidator:                   OK
      Size: 201
    mkCommitteeCandidateValidator (serialized):      OK
      Remaining headroom: 21
    mkCandidatePermissionMintingPolicy:              OK
      Size: 147
    mkCandidatePermissionMintingPolicy (serialized): OK
      Remaining headroom: 49
    mkCommitteeHashPolicy:                           OK
      Size: 400
    mkCommitteeHashPolicy (serialized):              OK
      Size: 2853
    mkUpdateCommitteeHashValidator:                  OK
      Remaining headroom: 31
    mkUpdateCommitteeHashValidator (serialized):     OK (0.01s)
      Remaining headroom: 100
    mkCheckpointValidator:                           OK
      Remaining headroom: 62
    mkCheckpointValidator (serialized):              OK
      Remaining headroom: 128
    mkCheckpointPolicy:                              OK
      Size: 400
    mkCheckpointPolicy (serialized):                 OK
      Size: 2853
  Distributed set
    mkInsertValidator:                               OK
      Remaining headroom: 29
    mkInsertValidator (serialized):                  OK
      Remaining headroom: 40
    mkDsConfPolicy:                                  OK
      Size: 457
    mkDsConfPolicy (serialized):                     OK
      Size: 2884
    mkDsKeyPolicy:                                   OK
      Size: 1228
    mkDsKeyPolicy (serialized):                      OK
      Remaining headroom: 40

Original comment from: @kozross

However, keeping in mind both current and future needs (readability, maintenance, stability), there's a few ways we can roll out these improvements. I'll list them below, along with my thoughts.

Option 1: the painful manual way

This is essentially what is currently on my branch. This involves some pretty repetitive, low-level and frankly un-idiotimatic (even by Plutus standards) code. While I can certainly explain how to do this kind of work (and it's pretty mechanical), it's definitely not fun, or readable. Pros of this approach: it's about as explicit as it gets (everything's right there). Cons of this approach: it's not great for readability (we'd need a writeup explaining this and the decisions around it), it's a pain to maintain (same reason) and if we ever decide it needs changing or there's more improvements to be had, we have to fix every single instance. I don't recommend this approach.

Option 2: TH that we control

Essentially, this involves writing makeIsDataProduct or something similar, which effectively generates the same code we'd get with Option 1. We'd have control over this derivation, and while writing it is a pain, it's a pain we have to experience once. Furthermore, unless you deeply care about this, it's not something you have to understand if you just want Data instances. Lastly, because we're in control, a Plutus update can't pull the rug out from under our feet. Pros of this approach: no worse than what we do currently, we control it for stability, optimization can be done in one place instead of every instance. Cons of this approach: TH is a royal pain to write and maintain. I'm a cautious fan of this approach.

Option 3: helper functions

Essentially, this would involve writing functions like this for all product arities we have (up to 6 at the moment):

{-# INLINE productToData2 #-}
productToData2 :: forall (a :: Type) (b :: Type) . (ToData a, ToData b) => a -> b -> BuiltinData

{-# INLINE productFromData2 #-}
productFromData2 :: forall (a :: Type) (b :: Type) (c :: Type) . (FromData a, FromData b) => BuiltinData -> (a -> b -> Maybe c) -> Maybe c

{-# INLINE productUnsafeFromData2 #-}
productUnsafeFromData2 :: forall (a :: Type) (b :: Type) (c :: Type) . (UnsafeFromData a, UnsafeFromData b) => BuiltinData -> (a -> b -> c) -> c

Then, we would define instances for our types like so:

data Foo (a :: Type) = Foo Integer a

instance (ToData a) => ToData (Foo a) where
   {-# INLINEABLE toBuiltinData #-}
   toBuiltinData (Foo x y) = productToData2 x y

instance (FromData a) => FromData (Foo a) where
    {-# INLINEABLE fromBuiltinData #-}
    fromBuiltinData dat = productFromData2 dat (\x y -> Just (Foo x y))

instance (UnsafeFromData a) => UnsafeFromData (Foo a) where
    {-# INLINEABLE unsafeFromBuiltinData #-}
    unsafeFromBuiltinData dat = productUnsafeFromData2 dat Foo

Pros of this approach: no TH as in Option 2, no awful soup as in Option 1, fairly explicit, not too much maintenance (only change a fixed number of functions, not every instance), the easiest to implement out of the three Cons of this approach: in theory, this should all inline away, but in practice, we can't be sure until we try, still fairly repetitive. I'd be OK with this.

input-output-hk / partner-chains-smart-contracts