Encrypting "contest data" (HMAC KDF)

danwallach commented 2 years ago

Sec. 3.3.3 (p. 20) has a conventional encryption of a contest, encoded as a byte array, and using of HMAC KDF. We need to implement this, and also verify that we're consistent with the KDF spec. (Or, verify that EG-Python and C++ are willing to abandon their non-compliant KDF.)

JohnLCaron commented 2 years ago

We could use a structured serialization (eg protobuf) for contest_data, then encrypt that. That gives us a uniform way to document the fields, as well as forward and backward compatibility as these fields evolve.

JohnLCaron commented 2 years ago

strawman proto for contest data

message ContestData {
  repeated uint32 overvotes = 1;  // list of selection sequence_number for this contest
  bool under_vote = 2;
  bool null_vote = 3;
  repeated uint32 write_ins = 4; // list of write_in ids
}

because protos are store efficiently, theres a good chance that only one 32 byte block is ever needed, as long as the variable write_in info is stored seperately.

So: encode proto into byte array. right pad to 32 bytes. HMAC encode that, and store into EncryptedBallotContest.

should do some test to see when 32 bytes is exceeded.

JohnLCaron commented 2 years ago

write variable-length write-in candidate information to a separate file (that's also part of the election record). The encrypted ballot stores a reference to that.

problems:

ballots not self contained
parallel processing hard(er)

danwallach commented 2 years ago

My initial idea is that the contest data encodes the original intent of the voter, before it got interpreted into a well-formed ciphertext. The fact that it was treated as an overvote is something that you wouldn't know until you looked, and then it would hopefully be obvious, so I don't think we need a flag for it.

So, my idea, if we were doing it as a protobuf, would be something like:

message ContestData {
    repeated uint8 selections = 1; // list of selection choices, in sequence order. 0 = no vote, 1 = vote, 2+ only with range/cumulative voting
    repeated string writeins = 2; // list of strings for all write-in positions, all fixed length (= 32)
}

Separately, we've got a problem with whether protobufs fixed length. What if those uint8 values encode to different numbers of bits as part of the protobuf compression? Certainly, a string of all whitespace will compress more than a string with real data in it.

We might need to define our own "encoding" rather than protobufs, specifically to ensure the output is a fixed number of bytes.

JohnLCaron commented 2 years ago

protobufs are (almost) always variable length. my thought is to zero pad them to 32 bytes.

danwallach commented 2 years ago

I'm worried about boundary conditions. If 0 or 1 is enough to nudge across a 32-byte boundary, then we leak information.

JohnLCaron commented 2 years ago

Yeah, need to study that some more. But the bottom line is we probably need a fixed length for any encoding. So what do you do if it gets exceeded in the various encodings?

danwallach commented 2 years ago

For everything but the write-in parts, we can define something that's clearly fixed length.

For the write-ins, I think we need to:

Define a maximum number of characters (after that, we truncate)
Define how we convert strings to bytes (UTF-8 is variable length; UTF-32 is always exactly 4 bytes per character)

JohnLCaron commented 2 years ago

Heres my current favorite variation. The idea is that the write_ins are the hashCode of the write-in string, and the strings are stored separately. the size is approx 2*n_overvotes + 4*n_writeins bytes

if the number of blocks really must be fixed, 32 or 64 bytes is reasonable, and you need an algorithm for what to do when thats exceeded.

message ContestData {
  enum Vote {
    normal = 0;
    null_vote = 1;
    under_vote = 2;
  }
  Vote vote = 1;
  repeated uint32 over_votes = 2;  // list of selection sequence_number for this contest
  repeated fixed32 write_ins = 3; // list of write_in string hashcodes
}

bytes  ContestData
0 = ContestDataPojo(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
2 = ContestDataPojo(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.null_vote(value=1))
2 = ContestDataPojo(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.under_vote(value=2))
5 = ContestDataPojo(overvotes=[1, 2, 3], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
6 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
7 = ContestDataPojo(overvotes=[111, 211, 311], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
9 = ContestDataPojo(overvotes=[111, 211, 311, 411], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
11 = ContestDataPojo(overvotes=[111, 211, 311, 411, 511], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1, 2, 3, 4], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[100, 200, 300, 400], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1000, 2000, 3000, 4000], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[10000, 20000, 30000, 40000], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[100000, 200000, 300000, 400000], voteEnum=ContestData.Vote.normal(value=0))
24 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1000000, 2000000, 3000000, 4000000], voteEnum=ContestData.Vote.normal(value=0))
28 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1000000, 2000000, 3000000, 4000000, 5000000], voteEnum=ContestData.Vote.normal(value=0))
28 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1000000, -2000000, 3000000, 4000000, 5000000], voteEnum=ContestData.Vote.normal(value=0))
28 = ContestDataPojo(overvotes=[1, 2, 3, 4], writeIns=[1958013297, -1007761232, 1587084778, 513679497, -865692041], voteEnum=ContestData.Vote.normal(value=0))

JohnLCaron commented 2 years ago

making it variable by contest would allow it to almost always be 1 block. For complicated contests can use more.

by that reasoning, say if there can only be 1 valid write-in, you could make the default say 2 blocks, and store the write-in internally, truncating it if needed.

message ContestData {
  enum Vote {
    normal = 0;
    null_vote = 1;
    under_vote = 2;
  }
  Vote vote = 1;
  repeated uint32 over_votes = 2;  // list of selection sequence_number for this contest
  repeated string write_ins = 3; // list of write_in strings
}

bytes  ContestData
0 = ContestDataStrings(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
2 = ContestDataStrings(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.null_vote(value=1))
2 = ContestDataStrings(overvotes=[], writeIns=[], voteEnum=ContestData.Vote.under_vote(value=2))
5 = ContestDataStrings(overvotes=[1, 2, 3], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
6 = ContestDataStrings(overvotes=[1, 2, 3, 4], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
7 = ContestDataStrings(overvotes=[111, 211, 311], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
9 = ContestDataStrings(overvotes=[111, 211, 311, 411], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
11 = ContestDataStrings(overvotes=[111, 211, 311, 411, 511], writeIns=[], voteEnum=ContestData.Vote.normal(value=0))
16 = ContestDataStrings(overvotes=[1, 2, 3, 4], writeIns=[a string], voteEnum=ContestData.Vote.normal(value=0))
22 = ContestDataStrings(overvotes=[1, 2, 3, 4], writeIns=[a long string ], voteEnum=ContestData.Vote.normal(value=0))
37 = ContestDataStrings(overvotes=[1, 2, 3, 4], writeIns=[a longer longer longer string], voteEnum=ContestData.Vote.normal(value=0))
184 = ContestDataStrings(overvotes=[1, 2, 3, 4], writeIns=[1000000, a string, a long string , a longer longer longer string, 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789], voteEnum=ContestData.Vote.normal(value=0))

JohnLCaron commented 2 years ago

In case its not clear, the message is right padded to a multiple of 32, then run through HashedElGamalCiphertext.

danwallach commented 2 years ago

I would prefer the ContestData to resemble a PlaintextContest as much as possible. Really, it should be a fixed-length serialization of a PlaintextContest.

Fixed length is harder than just padding, because if the length is right on a padding boundary, then changes to the contest will reflect in the length.

For contests with no write-ins, this is simple. We can use an array of uint8 or something similar. For contests with write-ins, we need one or multiple write-in slots where the slots are fixed size and truncate any overages.

If we define the length as being fixed, then the existing HashedElGamalCiphertext class will do everything else that we need, including padding.

JohnLCaron commented 2 years ago

Ive added the current HashedElGamalCiphertext serialize and deserialize to understand its effect. it adds an additional overhead of ~560 bytes to each encrypted contest, even if there is no contest_data. So this will increase the encrypted ballot size by ncontests * (560 + length of contest data).

JohnLCaron commented 2 years ago

Interesting, so you have an encrypted version of the original plaintext ballot in the encrypted ballot? That would certainly make testing easier. Is it for RLA?

danwallach commented 2 years ago

For RLAs, definitely. It's useful to commit to the entire PlaintextContest.

If we're in the relatively rare circumstance when a write-in candidate has lots of votes, then the fancy crypto only tells us the total number of write-ins, rather than who the write-ins are actually for. At that point, we need to decrypt... something.

This suggests two use cases: RLAs, where we want to decrypt everything, and write-in victories, where we want to decrypt just the write-ins. Normally, doing this in a secure way requires a mixnet, which isn't part of ElectionGuard.

I'd really like to get Josh to weigh in here. This has a lot of subtly that we want to get right. Hmm.

benaloh commented 2 years ago

The only absolute requirement is that, for any given contest, the write-in field is fixed length. This length could be an ElectionGuard default -- which could be overwritten in the manifest of a particular election -- which could in turn be overridden for a particular contest within a particular election (again using the manifest).

I think that the idea of the plaintext that is encrypted being a hash of the write-in -- with a list of things that have been hashed maintained by the encrypting device -- is interesting and worthy of further consideration. We'd have to be certain that the list of hash pre-images does not reveal any information (beyond "the hash of this value appears at least once among the plaintexts that have been encrypted). At a minimum, the list of pre-images would have to be sorted to avoid revealing ordering info.

If we don't hash, we need clear rules for what happens if the write-in exceeds the buffer (e.g., truncation to the maximum length that doesn't exceed the buffer). Some devices may have their own limits and will never pass ElectionGuard a write-in beyond a certain length, but we should not count on that and must always know what we will do to ensure that the plaintext buffer length is never exceeded.

danwallach commented 2 years ago

So if we store the encrypt(hash(serialize(PlaintextContest))), that's a commitment, it's fixed length, and its unique (because there are sequence numbers or object-ids in the PlaintextContest).

This assumes an external path to recovery for the PlaintextContest, but otherwise pushes all the processing outside of the cryptographic mechanisms of ElectionGuard.

This certainly seems like a really minimal solution to the problem. Is it sufficient?

JohnLCaron commented 2 years ago

If we can bound the number of possible write-ins per contest (to the allowed number of votes?), then Im leaning back to the original idea of just embedding the write-in strings (truncating to some MAX_WRITE_IN), for simplicity. Then the length of the encrypted message is fixed for each contest, number of blocks is small and is dominated by the 560 byte overhead.

An obvious choice is b_D = allowed number of votes, usually 1. MAX_WRITE_IN = 30.

danwallach commented 2 years ago

There's a certain elegance to just hashing the PlaintextContest structure. The question is whether that works for a larger election. You'd like to be able to do a tally purely from the ciphertexts.

JohnLCaron commented 2 years ago

Seems kind of strange that one could tally from the contest-data ciphertext. We have all this mechanism where we individually encrypt each selection. Isnt that all redundant if we can just decrypt the contest-data ciphertext?

benaloh commented 2 years ago

Please remember that the encryption is randomized. So the encryptions of identical plaintexts will be different. Therefore we can't just tally the ciphertexts. We could, however, just tally the hashes.

danwallach commented 2 years ago

Tallying hashes is tricky because there will be different spellings.

benaloh commented 2 years ago

Not so much. We'll get 12 John Smiths and 7 Jon Smyths and it will be up to local rules and officials to determine whether or not these tallies should be combined.

danwallach commented 2 years ago

So, I'm worried we're not really clear on exactly the purpose being served by this field. Is it meant to:

Contain the entire PlaintextContest
Contain just the write-in candidate names
Commit to the entire PlaintextContest without actually containing it
Commit to something else

benaloh commented 2 years ago

There is a text field associated with each write-in option. (Note that a "select up to three" contest can have three separate write-in options with three separate write-in text fields.) Each write-in text field should be able to capture a single free-form write-in (or suitable representation thereof). This could be text, a hash of text, or even "registered write-in candidate #1".

danwallach commented 2 years ago

We need to decide which of those it actually is.

JohnLCaron commented 2 years ago

hashedElGamalEncrypt takes about 160 msecs / contest to encrypt, 7 msecs to decrypt. Not very sensitive to number of blocks.

As reference, on same architecture (single threaded), each contest selection takes ~ 5 msecs. So a contest with 4 selections takes 20 msecs, now that weve eliminated placeholders.

These are estimates based on quick testing with the current kotlin library.

JohnLCaron commented 2 years ago

Its surprising that hashedElGamalEncrypt time doesnt go up with the number of blocks (anyone think thats wrong?)

Assuming thats true, then one could put all contests into a single structure, and only do one hashedElGamalEncrypt per ballot, making sure that the number of blocks is always the same. That makes the extra time ~160 msecs / ballot instead of 160 msecs / contest.

I made a quick test and it seems to work.

danwallach commented 2 years ago

160 msec seems higher than I'd expect. The most expensive operation is clearly the modular exponentiation, which is a fixed cost. After that, we're just running HMAC-SHA2-256 a bunch of times, and that's cheap, which explains why the computation time is largely independent of the length of the message being encrypted.

JohnLCaron commented 2 years ago

Theres also something funny where the first time you run it, its ~600 msecs, then after that its ~160. Maybe initializing tables??

I will add an issue to examine this code closer to make sure we arent doing anything stupid.

JohnLCaron commented 2 years ago

@dwallach when you say above

"So if we store the encrypt(hash(serialize(PlaintextContest))), that's a commitment, ... "

what is hash()? what is encrypt()?

@benaloh , how would you "tally the hashes"? Is that the same hash as Dan is talking about?

danwallach commented 2 years ago

Theres also something funny where the first time you run it, its ~600 msecs, then after that its ~160. Maybe initializing tables??

That sounds like initializing the modular exponentiation acceleration tables. It's a one-time computation. To keep it out of any benchmark computation, just throw away your first result and do timing on everything else.

benaloh commented 2 years ago

As Dan has mentioned, the cost of hashed ElGamal is dominated by the cost of the modular exponentiation to set it up. Once that's done, additional blocks are cheap. I believe that for this reason, we actually just do one hashed ElGamal per contest. The plaintext for this one hashed ElGamal is the concatenation of the (fixed-sized) extended data text field and the (fixed-sized) write-in text field(s).

Tallying of hashes would be just like tallying of plaintext data. We decrypt and find 12 instances of the hash value "3A7D3C4F ...". Our table says that "John Smith" hashes to this value, so that's 12 votes for "John Smith".

danwallach commented 2 years ago

Story time: there was once a powerful politician named Tom DeLay, a member of Congress from the Houston suburbs. "The Hammer" they called him.

Yet scandal finally caught up with him and he chose to resign his seat rather than face near-certain defeat at the polls. DeLay made one last exercise of power, which was delaying his resignation until after the deadline for which a candidate might have registered for the general election. He wanted to hand-pick his successor.

This created a wild and crazy election, wherein voters on the second Tuesday of November, were asked to vote for a candidate to serve the remainder of DeLay's term (the "special election"), as well as a separate contest for who would serve the subsequent term (the "general election").

And this brings us to Shelley Sekula-Gibbs, who was the "chosen one" to succeed DeLay, but after a lawsuit, it was determined that she could not run as a listed candidate in the general election contest because of deadlines. Instead, she was forced to run as a write-in candidate in the general election, but she was allowed to be a listed candidate for the special election for the one month remaining in DeLay's term.

The Hart InterCivic machines used in Houston at the time did not allow for a dash for a write-in candidate.

The entirety of Sekula-Gibbs's advertising consisted of instructions on how to vote for her.

She won the special election, for which she was a listed candidate. She lost the general election, for which she was a write-in candidate. She served in Congress for one month.

The sum total of all write-in votes in that general election were not enough for her to have possibly won, so that was the end of that. Otherwise, there would have been an enormous battle over whether each and every misspelling of her hyphenated last name did or did not constitute a vote for her. I had grand plans to build a system that would look at edit distance and other such metrics relative to Texas's somewhat ambiguous "intent of the voter" standard.

The moral of this crazy story is that misspellings matter, since, and I'm not making this up, there was no way on the Hart InterCivic voting machine to spell Sekula-Gibbs's name correctly!

What this means: Hashing is useful as a way to commit to a full plaintext that's available through some other means, but it's not useful for tabulation, because you'd need to decide on all the valid misspellings first.

P.S. Our county clerk, in what could have been interpreted as a partisan act, tried to given an assist to Sekula-Gibbs by printing her name on a piece of paper that was placed inside every voting booth.

P.P.S. In Texas, you can only win a write-in contest if you're "registered" as a write-in. This suggests a user experience where "write-ins" are actually a drop-down menu of "registered write-ins", with an option for an "unregistered" free text field. Whenever I've brought this up as a possible design, I've been told it would never be allowed.

benaloh commented 2 years ago

None of this concerns us. The host device gives us a name for each write-in. The name could be normalized before we get it -- or not. Our job is just to tally the names we receive. If we receive different but similar names, they are still different names. We simply report (verifiably) how many times we saw each name. Tallies for similar names may be combined after we report them -- or not. We just tally what we get.

JohnLCaron commented 2 years ago

Funny story about not allowing dashes. Remember those IBM EBCDIC punched card writers that only used upper case letters? LIKE GRANDPA SHOUTING AT YOU.

I agree with Josh that we can just tally whatever is given us, and the election officials decide which should be merged.

Using a fixed length hash means our system doesnt have to truncate write-in names, and makes it easier to create fixed length contest data structures. At the cost of being a bit more complicated. But I dont see any big difference in functionality.

I found the problem with hashedElGamalEncrypt taking ~160 msecs, it was in the test code. Things look normal now.

JohnLCaron commented 2 years ago

Can one use the nonce to decrypt the hashedElGamalEncrypt ?

danwallach commented 2 years ago

Put that in as a feature issue. I'll do it after I finish the range proofs (currently up as a draft PR, but currently untested so unlikely to be correct).

JohnLCaron commented 2 years ago

Also, it looks to me that HashedElGamalCiphertext.numBytes == c1.size always, so we dont need to store it or serialize it. Do you agree?

danwallach commented 2 years ago

Agreed

JohnLCaron commented 2 years ago

Im about to send a PR with the following:

nblocks = votes_allowed + 1

The ContestDataProto is padded with a filler to make sure it is always the same length when serialized. The serialization is the hashedElGamalEncrypt message byte array.

When the ContestDataProto is too large, we trim it down by: 1) truncating any write-ins to 30 chars 2) throwing away write-ins if there are more than votes_allowed + 1 3) throw away extra over_votes if we still need more room.

I think this is a good first pass that almost never will be needed, and protects against overflowing when being sent lots of write-ins / over_votes.

This adds 560 + 64 bytes to each contest for the common case of votes_allowed = 1.

It would be easy enough to change it to be a single proto for all contests in the ballot, which would reduce size to 560 + ncontests * 64 per ballot, for the common case.

Could also experiment with storing just the hashes of the write-ins in the contests, and putting the full text write-ins into a separate file.

danwallach commented 2 years ago

First thought: protobufs aren't a good way to store something that fundamentally needs to be fixed length, because we don't want the size of the output to indicate anything at all. I think, for this specific case, that we need to hand-roll the data structure encoding.

JohnLCaron commented 2 years ago

I agree that rolling our own format to be sure its fixed length is worth considering. Protobuf inherently produces variable length records. Its also desirable to use Unicode, and UTF8 is the best way to do that, which is also variable length.

OTOH, I think I have something working, and Ive just added property tests to be sure we always produce fixed length records. It truncates only as much as is needed to fit the message size, and adds a filler field to even out the size. In the common case (a small number of write-ins and overvotes), no truncation will be needed.

Making everything into fixed length records risks unnecessarily truncating strings and limiting max number of fields. UTF-16 is less common to work with, and doubles the amount of space most strings will use. Documenting a bespoke format that probably wants to evolve in the future is a lot of work.

Im not convinced using protobufs is the right thing to do, but we can use it for now and explore a hand-rolled alternative if desired. The nice thing is that we can switch it out without changing any of the crypto.

danwallach commented 2 years ago

FYI: Just added code to support HashedElGamalCiphertext.decryptWithNonce(), which I saw mentioned earlier here.

Ultimately, we need to know whether Microsoft wants to leave the interpretation of the this contest data field up to the implementation (seems unlikely) or whether they want to specify it in more detail.

votingworks / electionguard-kotlin-multiplatform

Encrypting "contest data" (HMAC KDF) #155