using shuffled cards loses entropy

dooglus commented 7 years ago

A recent reddit post caused me to notice how you're generating entropy from cards, and how it is throwing away some entropy.

The last 12 or so cards in a full 52 card shuffled deck have no effect on the generated seed because of this code:

// Calculate the new value by scaling the original value down
var withoutReplace = entropyInt.multiply(maxWithoutReplace).divide(maxWithReplace);

The entropyInt is made by spacing all the individual cards out equally (log(52) bits per card), but the 'scaling' will effectively chop off the end of the string, allowing only the first N cards to have any effect on the outcome. Rather than scaling like this, where only the leading bits affect the outcome, I would use an SHA-2 hash of all the bits and take the first N bits of the result, so that all the input bits affect the outcome.

iancoleman commented 7 years ago

Thanks for this.

To better understand how the raw entropy is expected to work, I made a table of how the bits should be distributed as cards are consumed. It shows with 0-2 remaining cards there should be a 21 word mnemonic, with 3-13 remaining cards there should be an 18 word mnemonic. Initially I thought this would account for the behavior because of the small amount of total entropy added by the last 12 cards, but it's clear there's still an error with the 'scaling' process as currently implemented.

I'm going to fix the scaling process, but using cards + raw entropy will still have a 'thinning' of bits at the end of the deck.

I concede that using the raw entropy to generate mnemonics often gives unexpected results for users, and possibly shouldn't be the default selection. But having it available is good for compatibility with bip32jp.github.io, and I think having alternative interoperable implementations is important for the ecosystem.

Changes to make:

Fix scaling of entropy from cards
Change default Mnemonic Length value to 15 words

Cards Used	Cards Remaining	Bits Contributed	Bits Accumulated	Bits Remaining	Words
0	52	5.70	0.00	225.58	0
1	51	5.67	5.70	219.88	0
2	50	5.64	11.37	214.21	0
3	49	5.61	17.02	208.56	0
4	48	5.58	22.63	202.95	0
5	47	5.55	28.22	197.36	0
6	46	5.52	33.77	191.81	3
7	45	5.49	39.29	186.29	3
8	44	5.46	44.79	180.79	3
9	43	5.43	50.25	175.34	3
10	42	5.39	55.67	169.91	3
11	41	5.36	61.06	164.52	3
12	40	5.32	66.42	159.16	6
13	39	5.29	71.74	153.84	6
14	38	5.25	77.03	148.55	6
15	37	5.21	82.28	143.30	6
16	36	5.17	87.49	138.09	6
17	35	5.13	92.66	132.92	6
18	34	5.09	97.79	127.80	9
19	33	5.04	102.87	122.71	9
20	32	5.00	107.92	117.66	9
21	31	4.95	112.92	112.66	9
22	30	4.91	117.87	107.71	9
23	29	4.86	122.78	102.80	9
24	28	4.81	127.64	97.94	9
25	27	4.75	132.44	93.14	12
26	26	4.70	137.20	88.38	12
27	25	4.64	141.90	83.68	12
28	24	4.58	146.54	79.04	12
29	23	4.52	151.13	74.45	12
30	22	4.46	155.65	69.93	12
31	21	4.39	160.11	65.47	15
32	20	4.32	164.50	61.08	15
33	19	4.25	168.83	56.76	15
34	18	4.17	173.07	52.51	15
35	17	4.09	177.24	48.34	15
36	16	4.00	181.33	44.25	15
37	15	3.91	185.33	40.25	15
38	14	3.81	189.24	36.34	15
39	13	3.70	193.05	32.54	18
40	12	3.58	196.75	28.84	18
41	11	3.46	200.33	25.25	18
42	10	3.32	203.79	21.79	18
43	9	3.17	207.11	18.47	18
44	8	3.00	210.28	15.30	18
45	7	2.81	213.28	12.30	18
46	6	2.58	216.09	9.49	18
47	5	2.32	218.67	6.91	18
48	4	2.00	221.00	4.58	18
49	3	1.58	223.00	2.58	18
50	2	1.00	224.58	1.00	21
51	1	0.00	225.58	0.00	21
52	0	0.00	225.58	0.00	21

iancoleman commented 7 years ago

See

https://github.com/iancoleman/bip39/commit/886f06ee6bbca2e6e05109dbd7e264bdea35238f - Card entropy calculation bugfix
https://github.com/iancoleman/bip39/commit/55b592b498a57465b91bd8402d0d91324cc34cf9 - Entropy defaults to generating a 15 word mnemonic

The fix was to scale each event individually, rather than concatenate them and then scale. So each event contributes the correct number of bits as indicated in the table above, rather than all contributing 5.7 bits and then being scaled uniformly.

dooglus commented 7 years ago

I've not had a chance to fully understand your changes yet, but I'm hoping you can clarify a couple of things:

Are you assuming that the total number of decks I have shuffled is ceiling(cards/52)? Because I could shuffle 8 decks together and only pick 52 cards, but you would set totalDecks to 1 in that case:

var totalDecks = Math.ceil(base.parts.length / 52);
What is remainingDecks for? It seems to me like it will always be zero:

var remainingDecks = Math.floor(totalRemainingCards / 52);

iancoleman commented 7 years ago

Total decks is calculated incorrectly, and would be better calculated by looking for max(duplicates). There's currently a small discrepancy between calculating total number of bits and checking how much entropy each individual event adds.

A good test would be to use AS AS AS AS; by the first method it would calculate log2(52!/48!) = 22.63 bits of entropy; by the second method it would calculate 4*log2(52) = 22.8 bits of entropy. This discrepancy is rarely noticed because of flooring, but may create issues so I will fix it. Use of a full deck (or using any multiple full decks) will not have this discrepancy.

I'll fix this it to correctly tally the number of bits on a per event basis.

I'm still unclear about one aspect: shuffling two decks together should have slightly more entropy than shuffling one deck twice in a row. But my code doesn't account for that. Any ideas how this might work? My probability theory isn't especially strong.
remainingDecks is used to calculate remainingCombos, which ultimately determines the numberOfBits. If there are 2 decks in use (104 cards) and only 5 cards have been removed, there is still one full deck and 47 cards remaining, hence the remainingDecks variable.

But there is an error here, since if the first two cards are AS and AS there are only 51 cards remaining from each deck. 51!51! combos remain (439.76 bits), not 52!50! (439.78 bits). Once again it's a tiny error which is almost never visible due to flooring, but should be corrected.

I need to change the calculation of numberOfBits to work better with multiple decks.

Note that this test for 2 decks passes - it has 451 bits, ie 2*log2(52!).

As an aside, a single deck provides more than enough entropy, and just using a single deck would greatly simplify the code. However, I plan to make the entropy library standalone so would like it to be as generic as possible for applications that may require more entropy than a single deck.

dooglus commented 7 years ago

It doesn't feel right to see you looking at the cards in turn to decide how many bits of entropy it provides. It seems to me that picking 10 cards from 4 decks shuffled together should give a certain amount of entropy, regardless of what those 10 cards happen to be, but you're looking at the cards and somehow deciding that the first 6 came from the first deck and the 2nd 6 came from the 2nd deck. I don't think that's a valid assumption to make.

I can't find a good way of calculating the amount of entropy when drawing N cards from M decks. I don't think you even know how many decks I'm drawing from. Just because max(dups) is 2 doesn't mean I didn't use 4 decks. I just didn't get more than 2 dups from those 4 decks. In fact I might even be drawing with replacement from a single deck, shuffling it after every draw. This would be the same as using an infinite number of decks shuffled together.

In general I think you're trying to solve a hard problem that doesn't really need solving. All this appears to be motivated by the requirement of compatibility with bip32jp.github.io, but it looks to me like that site doesn't even allow you to enter cards. So let's assume max(dups) decks, and conservatively assume 52! permutations for each of the floor(length/52) decks, and (length%52)! permutations for the leftovers. The binary representation can be the first log(permutations,2) bits of the sha-2 hash of the normalized deck string. Then you have a 1-1 correspondence between the binary representation and the mnemonic, compatible with bip32jp.

Am I missing something?

dooglus commented 7 years ago

To be clearer:

You are using linear scaling to convert an M bit number which you believe to have only N bits of entropy into an N bit number, but the entropy isn't all at the left hand edge of the M bit number, so you are losing some of the entropy. Your solution is to try to understand more clearly how the entropy is distributed, and to use that understand to improve your linear scaling algorithm.

I propose that using first_N_bits(SHA-2(M bit number)) instead of linear scaling will do what you're looking to do without loss of entropy, and without you having to know how the entropy is distributed throughout the M bit number.

You can then use that N bit number in just the same way as bip32jp uses it, to maintain compatibility.

iancoleman commented 7 years ago

It doesn't feel right to see you looking at the cards in turn to decide how many bits of entropy it provides.

Agreed. To rephrase it, a card sitting on the top of the deck provides a specific number of bits of entropy before being turned over. I agree with this, but don't know how to calculate that value. With a single deck it's easy to calculate, but with multiple decks, I'm unsure.

I think you're trying to solve a hard problem that doesn't really need solving. All this appears to be motivated by the requirement of compatibility with bip32jp.github.io

Yeah, I think this has gone too far down the rabbit-hole for what it's worth. Better to have a naive-but-clear solution than a complex-but-still-incorrect solution.

If anyone in the future with a better understanding of stats / probability can illustrate how to make this work as desired, then I'll implement it. But for now I'll make it good-enough with clearly separated concerns in case of future improvements.

Changes to make:

Amount of entropy added should not depend on the specific card drawn
totalDecks calculation should use max(dupes)
Hash the supplied card values and take the first N bits according to log2(permutations)
Structural changes to facilitate simpler upgrades in the future

Appreciate the clarity you've bought to this topic.

iancoleman commented 7 years ago

I've asked a question at https://crypto.stackexchange.com/questions/41886/entropy-measurement-from-shuffled-cards

iancoleman commented 7 years ago

I can't see any way to improve this feature. The current implementation using 'average entropy' is suitable and is based on the best information currently available.

I'll close this for now since it a) works and b) has no clear direction for potential improvement. Always happy to have it reopened if more information comes to light.

Great feature too, @dooglus; I'm very glad to have it available in the tool.

dooglus commented 7 years ago

I didn't receive notification of your commit (87ad2c6) which uses hashing instead of scaling to convert the cards to binary, and only just noticed the commit now. I'm happy to see you implemented my suggestion.

This comment and corresponding code concerns me:

    // If the number of bits is more than 256, multiple rounds of hashing
    // are used until the required number of bits is reached.

        for (var j=0; j<iterations; j++) {
            hashedCards = sjcl.hash.sha256.hash(hashedCards);
        }

You don't add any extra entropy by using multiple rounds of hashing. There are 2^256 possible outputs from sha256. If I need 512 bits of entropy, I can't take a text string, hash it once, create a 256 binary digit string, then hash it twice, append another 256 binary digit string, and expect that the output has 512 bits of entropy. While it's true that I'll have a 512 digit binary string, there are at most 2^256 different 512 digit binary strings I could end up with, and so the 512 digit binary string has at most 256 bits of entropy.

In other words, suppose normalizedCards has 512 bits of entropy.

Then sha256(normalizedCards) has 256 bits of entropy, and sha256(sha256(normalizedCards)) has the same 256 bits of entropy.

Concatenating those two outputs gives a string with 256 bits of entropy.

A solution is to use sha256(normalizedCards + ':0') and sha256(normalizedCards + ':1') (where + is string concatenation), since concatenating those two gives you an output with 512 bits of entropy.

Consider that once I know just sha256(normalizedCards) and don't know normalizedCards, I can trivially compute sha256(sha256(normalizedCards)). But if I know just sha256(normalizedCards + ':0') and don't know normalizedCards, then I have no idea of the value of sha256(normalizedCards + ':1').

dooglus commented 7 years ago

Maybe it's clearer in code:

diff --git a/src/js/entropy.js b/src/js/entropy.js
index 24fc21a..918f567 100644
--- a/src/js/entropy.js
+++ b/src/js/entropy.js
@@ -293,15 +293,12 @@ window.Entropy = new (function() {
         // Create a normalized string of the selected cards
         var normalizedCards = cards.join("").toUpperCase();
         // Convert to binary using the SHA256 hash of the normalized cards.
-        // If the number of bits is more than 256, multiple rounds of hashing
+        // If the number of bits is more than 256, multiple hashes
         // are used until the required number of bits is reached.
         var entropyBin = "";
         var iterations = 0;
         while (entropyBin.length < numberOfBits) {
-            var hashedCards = sjcl.hash.sha256.hash(normalizedCards);
-            for (var j=0; j<iterations; j++) {
-                hashedCards = sjcl.hash.sha256.hash(hashedCards);
-            }
+            var hashedCards = sjcl.hash.sha256.hash(normalizedCards + ':' + iterations);
             var hashHex = sjcl.codec.hex.fromBits(hashedCards);
             for (var i=0; i<hashHex.length; i++) {
                 var decimal = parseInt(hashHex[i], 16);

dooglus commented 7 years ago

Also, I wonder if it's worth pointing out on the page that the algorithm used to create the seed from the supplied entropy isn't yet set in stone.

I can imagine someone using a deck of cards as their backup, "knowing" that your tool will always be able to regenerate their BIP39 seed words from the order of the deck if they need it. Changing how the supplied entropy is converted to a BIP39 seed will break his assumption.

iancoleman commented 7 years ago

Great suggestions.

I can imagine someone using a deck of cards as their backup

I see your comment in How can I encode my BIP39 seed into the order of cards of a deck and agree it's worth noting that the entropy feature is under development and may change.

Fix hashing iterations
Add warning to the ui about possible changes to the entropy feature

iancoleman commented 7 years ago

See

https://github.com/iancoleman/bip39/commit/9d33c8925dc07783e2cd819c91eee1144598fcf2 - Card entropy uses secure hashing when > 256 bits
https://github.com/iancoleman/bip39/commit/bf9bdfd20ff1997dc5c53ba5a25f6441b9edd3b7 - Note that entropy must not be stored

I figured there is already a strong warning above the entropy field, and adding more text there was not as helpful as a detailed explanation with the rest of the entropy details.

Since there are many reasons not to store entropy (algorithmic changes during development, lack of standardization, interoperability, security etc) I don't explicitly warn of the reasons.

iancoleman commented 7 years ago

This is an interesting android app using cards to generate addresses - deckwallet

Source code for MainActivity.java

Doesn't seem to be BIP32 compatible so this tool will probably never be compatible with deckwallet.

See getAddress which uses getSeed which uses generateBrainWifKey

Also see How It Works

iancoleman commented 7 years ago

@dooglus, are you content for this issue to be closed now? I think the feature is well designed and robust thanks to the conversation that came from these issues.

iancoleman / bip39

using shuffled cards loses entropy #33