Test Vectors (TreeMath, Encryption)

I've started implementing test vectors for openmls (https://github.com/openmls/openmls/pull/299) (only tree math and encryption for now). I generally like the approach in test-vectors.md. But my versions differ slightly from the ones defined in test-vectors.md. Let's try to find a common ground that works for all implementations.

@bifurcation let me know what you think.

TreeMath

struct {
  uint32 n_leaves;
  optional<uint32> root;
  optional<uint32> left<0..2^32-1>;
  optional<uint32> right<0..2^32-1>;
  optional<uint32> parent<0..2^32-1>;
  optional<uint32> sibling<0..2^32-1>;
} TreeMathTestVector;

Only include the one root for the tree of n_leaves leaves.
Make all computed values optional because the computation for all of them might fail. Add None/null for the invalid cases.

There's a sample file here.

Encryption

 struct {
   opaque key<0..255>;
   opaque nonce<0..255>;
 } KeyAndNonce;

 struct {
   opaque ciphertext<0..2^32-1>;
   KeyAndNonce secret;
 } SenderDataInfo;

 struct {
   opaque plaintext<0..2^32-1>;
   opaque ciphertext<0..2^32-1>;
 } Message;

 struct {
   uint32 generations;
   KeyAndNonce handshake_keys<0..2^32-1>;
   KeyAndNonce application_keys<0..2^32-1>;
   SenderDataInfo sender_data_info<0..2^32-1>;
   Message messages<0..2^32-1>;
 } LeafSequence;

 struct {
   uint16 cipher_suite;
   uint32 n_leaves;
   opaque encryption_secret<0..255>;
   opaque sender_data_secret<0..255>;

   LeafSequence leafs<0..2^32-1>;
 } EncryptionTestVector;

This has a couple more differences.

Encrypt a message every generation (either with the application or the handshake key).
Add an additional sender_data_info check in order to test the sender data keys separately. This is mostly because I currently can't access them in openmls from the test. But I think because it's key material it's worth testing in any case.
I moved the keys and messages into a separate struct to make the test vector more legible.
I replaced the CryptoValue to make it a little more concise.

There's a sample file here.

Representation

Instead of TLS representation I used JSON. I think having readable test vectors is a good thing because you don't need to run a (potentially faulty) parser on them when debugging issues. If there's a good reason to have TLS encoded test vectors I could add it to openmls in addition to the JSON version.

The reasoning behind using the TLS syntax was that MLS implementations are already required to serialize/deserialize TLS-encoded structs. If you're really keen on JSON, I could probably be persuaded. If we were going to do that, though, I would suggest:

Representing binary data as a hex-encoded string (since that is more readable than an array/base64)
Making the sample files pretty-formatted (with indentation, etc.)

If we do JSON, we can probably also make the syntax spec more informal. There's no need to specify vector headers, for example. (JSON schema exists, but I'm not inclined to use it.) So we could have something like the following:

// TreeMathTestVector
{
  "n_leaves": /* uint32 */,
  "root": [ /* array of uint32 */ ],
  "left": [ /* array of uint32 */ ],
  "right": [ /* array of uint32 */ ],
  "parent": [ /* array of uint32 */ ],
  "sibling": [ /* array of uint32 */ ]
}

// EncryptionTestVector
{
  "cipher_suite": /* uint16 */,
  "n_leaves": /* uint32 */,
  "encryption_secret": /* hex-encoded binary data */,
  "sender_data_secret": /* hex-encoded binary data */,
  "leaves": /* array of LeafSequence objects */
}

On the specific changes:

Only include the one root for the tree of n_leaves leaves.

OK with me, just seems like you have to generate more tests to get the same coverage.

Make all computed values optional because the computation for all of them might fail. Add None/null for the invalid cases.

These methods should all be infallible, and there are no invalid cases. What failure cases do you have in mind?

Encrypt a message every generation (either with the application or the handshake key).

Is your intent that the plaintext and ciphertext fields contain encoded MLSPlaintext and MLSCiphertext objects?

Add an additional sender_data_info check in order to test the sender data keys separately. This is mostly because I currently can't access them in openmls from the test. But I think because it's key material it's worth testing in any case.

I agree that it makes sense to test sender data key generation separately. I would suggest we just do it once, though, at the top level of EncryptionTestVector. By adding sender_data_ciphertext, sender_data_key, and sender_data_nonce values there. (Or one field wrapping those three into a struct)

I moved the keys and messages into a separate struct to make the test vector more legible.

Huh, thought we had this before. Ok.

I replaced the CryptoValue to make it a little more concise.

Sure. If we go with JSON, it can get yet more concise.

The reasoning behind using the TLS syntax was that MLS implementations are already required to serialize/deserialize TLS-encoded structs. If you're really keen on JSON, I could probably be persuaded. If we were going to do that, though, I would suggest:

Yeah so I heard. I agree that TLS syntax would be more concise and it's sort of there already. But especially for large test vectors I find JSON way easier to handle. I'd still use the TLS serialization for MLS primitives (as done in the encryption test vector proposal) but make the test vector itself JSON. This way the MLS code and its serialization doesn't need to change and JSON is a dev-only dependency.

1. Representing binary data as a hex-encoded string (since that is more readable than an array/base64)

👍🏻 I agree that would be more readable.

2. Making the sample files pretty-formatted (with indentation, etc.)

With the change in 1 we can make the JSON files pretty (the byte arrays make formatted JSON ugly). serde_json outputs pretty printed JSON directly so this would be a one line change for me. Pretty printing blows up the file size though.

If we do JSON, we can probably also make the syntax spec more informal. There's no need to specify vector headers, for example. (JSON schema exists, but I'm not inclined to use it.) So we could have something like the following:
// TreeMathTestVector
{
  "n_leaves": /* uint32 */,
  "root": [ /* array of uint32 */ ],
  "left": [ /* array of uint32 */ ],
  "right": [ /* array of uint32 */ ],
  "parent": [ /* array of uint32 */ ],
  "sibling": [ /* array of uint32 */ ]
}

// EncryptionTestVector
{
  "cipher_suite": /* uint16 */,
  "n_leaves": /* uint32 */,
  "encryption_secret": /* hex-encoded binary data */,
  "sender_data_secret": /* hex-encoded binary data */,
  "leaves": /* array of LeafSequence objects */
}

I agree we shouldn't go for a JSON schema. I wouldn't mind a less strict definition of the test vectors.

On the specific changes:

Only include the one root for the tree of n_leaves leaves.

OK with me, just seems like you have to generate more tests to get the same coverage.

I revisited this. There was an oddity in our code that made me thing this would be bad. Let's go with a vector of roots as you proposed.

Make all computed values optional because the computation for all of them might fail. Add None/null for the invalid cases.

These methods should all be infallible, and there are no invalid cases. What failure cases do you have in mind?

Not all of them are infallible. I unified them to be all optional but actually only some of them are:

left and right fail if the index is a child. From the way I read the test vector description right now this actually happens ("left[i] is the node index of the left child of the node with index i in a tree with n_leaves leaves")
parent and siblings fail for the root node. Again a case I think happens with the current description (happens in my test cases)

Encrypt a message every generation (either with the application or the handshake key).

Is your intent that the plaintext and ciphertext fields contain encoded MLSPlaintext and MLSCiphertext objects?

Yes, that's the idea. As I said above. I think every MLS object should be TLS encoded (no additional code needed).

Add an additional sender_data_info check in order to test the sender data keys separately. This is mostly because I currently can't access them in openmls from the test. But I think because it's key material it's worth testing in any case.

I agree that it makes sense to test sender data key generation separately. I would suggest we just do it once, though, at the top level of EncryptionTestVector. By adding sender_data_ciphertext, sender_data_key, and sender_data_nonce values there. (Or one field wrapping those three into a struct)

👍🏻 agreed. I added a wrapping struct.

I updated the test vector files in openmls:

with the following structure

{
    "cipher_suite": /* uint16 */,
    "n_leaves": /* uint32 */,
    "encryption_secret": /* hex-encoded binary data */,
    "sender_data_secret": /* hex-encoded binary data */,
    "sender_data_info": {
        "ciphertext": /* hex-encoded binary data */,
        "secrets": {
            "key": /* hex-encoded binary data */,
            "nonce": /* hex-encoded binary data */,
        },
    },
    "leaves": [
        {
            "generations": /* uint32 */,
            "handshake_keys": [ /* array with `generations` handshake keys and nonces */
                {
                    "key": /* hex-encoded binary data */,
                    "nonce": /* hex-encoded binary data */,
                },
                ...
            ],
            "application_keys": [ /* array with `generations` application keys and nonces */
                {
                    "key": /* hex-encoded binary data */,
                    "nonce": /* hex-encoded binary data */,
                },
                ...
            ],
            "messages": [
                /* array with `generations` TLS encoded MLSPlaintext/MLSCiphertext pairs. */
                {
                    "plaintext": /* hex-encoded binary data */,
                    "ciphertext": /* hex-encoded binary data */,
                },
                ...
            ],
        }
    ]
}

{
    "cipher_suite": /* uint16 */,
    "root": [ /* array of uint32 */ ],
    "left": [ /* array of option<uint32> */ ],
    "right": [ /* array of option<uint32> */ ],
    "parent": [ /* array of option<uint32> */ ],
    "sibling": [ /* array of option<uint32> */ ]
}

@franziskuskiefer sounds like we're pretty much in agreement. Could you write a PR to update the spec? I can get the mlspp vector generation/processing updated then.

One thing that would be good to clarify in that PR: When you say option<uint32>, does that mean that the value can be either a JSON number or null?

mlswg / mls-implementations