hkpori / popvote4-spec

Specification of POPVote 4
Apache License 2.0
98 stars 7 forks source link

The "Voter ID" is susceptable to Birthday Paradox, causing 0.01% of valid votes to be mistreated as duplicates #3

Open kennytm opened 4 years ago

kennytm commented 4 years ago

As explained in #2, the Voter ID only has 31.0 bits of entropy. If the Voter IDs are random and uniform, when there are more than 231.0 / 2 = 45145 voters, there would be ≥50% chance of two distinct voters sharing the same ID.

In fact, given 600,000 unique voters, there will be 83 ± 10 votes mistreated as duplicates like this.

suesser commented 4 years ago

I agree with this conclusion.

The fact that the date of changing HKID for some people is heavily distributed within a small range of date makes the case even worse.

ronadihkg commented 4 years ago

For the date part, we can further exclude all public holidays (17 + all Sunday) when the immigration department is closed. So this part provides only log_2((365.25 - 17) x (6 / 7) × (2020 − 2003 + 1)) = 12.39 bits of entropy.

Plugging it into the equations, the entropy is only 30.63 bits. When there are more than 40764 voters, there would be ≥50% chance of two distinct voters sharing the same ID.

kcchu commented 4 years ago

I am the author of the cryptography specification. Thanks for your comment.

The collision problem was known during the design and we have done careful simulation and analysis on it. Your calculation align with what we did, and indeed we used a more pessimistic model that all HKID were replaced within one year.

The problem here is a trade-off between maximizing privacy and minimizing collision. Because we considered voter privacy as one of the highest priority of the system, we chose to accept a very low error rate caused by this issue as statistical error, so that we can minimize the amount of personal information that we collect from voter.

I am happy to keep this thread for the discussion on improvement of the design for the future.

ronadihkg commented 4 years ago

I understand that if you mark too many personal information in a paper-based vote (to check for duplicated votes) would be quite suspicious to the voters. So it is a hard problem to have a balance...

Using 31.0 bit of entropy will cause an "error rate" ~83 votes for 600,000 voters. However, using 30.63 bit of entropy will cause an "error rate" ~216 votes for 600,000 voters. Currently, the smallest margin in the result is 120 in Kowloon west. Luckily we only have 81,000 voters there... It will be problematic if this case happens in 區議會(第二)功能界別.

(Also if you use the pessimistic model that all HKID were replaced within one year, the error rate is 3900 for 600,000 voters)