Proposal to Fix Advanced Entropy Section, w/ Resulting Mnemonic to avoid Bit Loss (i.e. bits 129-159)

Using the advanced section with "Use Raw Entropy (3 words per 32 bits)" selected, I am noticing that there is either a loss of entropy bits or those bits are not being displayed (I am guessing it is the latter), which in either case must be an error.

Problem /Bug in Advanced Entropy Section:

For example, if pasting 128 bits, we know the tool will calculate the 4-bit checksum correctly to arrive at a total of 132 bits, and the resulting mnemonic represents all those 132 bits in their totality. However, if pasting the same bits plus 1 extra bit (129 total), a different mnemonic will be displayed even though the raw binary bits 1 through 128 remain the same, where the different 12 words shown do not match and their index values do not represent the raw binary shown and entered. I believe both the index values and corresponding words are incorrect when there is less than the required number of bits for a full word. If so, I would recommend displaying some error such as "not enough bits for additional words", and then reverting to the original mnemonic and its entropy or something to that extent.

Subtle Risks That Could be Unobvious:

The problem with this error is that anyone using the tool to backup some data such as a password or key into mnemonic format will lose bits 129 through 159 or any other range where total bits entered modulo 32 does not equal zero. This results in those extra bits lost, unless I am mistaken here, below are screenshots to help investigate, and this can be easily replicated with any entropy and comparing the raw binary with the words and index values as an extra bit is added, such as described above.

Correct working with 128 bits:

Example of bug/error with same 128 bits plus 1 extra bit (notice highlighted first 11-bit group is the same as prior screenshot but incorreclty maps to different word):

Note: After some observation, I noticed that the extra bits added have a waterfall effect that cause the existing bits to shift right, such as where the correct word "fetch" (index 682) which is binary 01010101010 becomes "primary" (index 1365) as binary 10101010101. And this is not visible in the raw binary section, but only by comparing the word changes, as in the above screenshot.

Suggestion:

Those bits appear to be lost and not mapped anywhere? I have not analyzed the part of the code where this happens. I think this error is mainly because at least 32 bits are needed per 3 words, so bits 129 through 159 are not enough, but it would be better if it was one word per 11 bits, so a new word could be shown even if 1 bit was entered by treating it as an 11-bit number (i.e. 1 would be 00000000001), so in the event someone had extra bits, rather then those being lost, the resulting mnemonic would correctly represent all bits as it should.

With respect to @iancoleman who has already put in multiple warnings to advise users that these advanced features should not be used unless you know what you are doing, I think having it work with such a bug could do harm to those who might be unsuspecting and adding an error message or other workaround to resolve this that would also help make the tool more safe and useful.

p.s. @iancoleman It would be great to set up a grant for this repo on Gitcoin, as any funding donated by the community could be used to pay devs (yourself included) to speed up any work here (in addition to what you have donated to EFF, I thought this grant idea could help support the app, plus Gitcoin is a great resource) Here is the link: https://gitcoin.co/grants/

The entropy is discarded from the start (ie the newest 32N bits of entropy are used)

See https://github.com/iancoleman/bip39/blob/717a3ffcb285233e2d029ac5f544f70410d22d3d/src/js/index.js#L1296-L1299 note the comment on L1296 says trailing entropy but the implementation is actually discarding the leading entropy, so the comment is incorrect in this case.

So by adding the 129th bit of entropy to the end the tool will use bits 1-129, not 0-128.

I don't think adding magic padding to the user entropy is wise. Better to discard, assuming the newest entropy is 'better' than the older entropy (as per the current implementation). I'm open to being convinced otherwise though.

Hi Ian, thanks for the explanation here and willingness to reconsider, makes sense now. I have since renamed this issue, as a proposal to avoid the bit-loss. I strongly think it is worth considering for zero bits to be discarded in this case, and that the best options would be to either:

A) treat any modulo 32 remainders after "total bits % 32" that is greater than zero as 'not permissible' and throw an error or ideally:

B) treat any such modulo 11 remainder after "total bits % 11" that is greater than zero as an 11-bit big integer (left-padded with zeroes).

Question: Would this affect the checksum process (I think it does for 13 word and 14 word ones, as you can't have a checksum as a floating point number)), as word-count/3 should equal checksum bits as a whole number integer, just as the initial entropy bits/32 would. Perhaps best for those scenarios is to not include a checksum, and treat those non-checksum word lengths as checksum-free?

Reasoning:

I think the strong reasons to consider this change are: 1) I don't think it is obvious to users that entropy is otherwise discarded as all bits remains pasted and visible, and even reflected in the raw entropy section.

2) If a user has pasted bits into the tool, then all such bits should be considered "good bits" and there should be no bad bits left (any such bad bits should be manually removed if needed - but only by the users discretion, rather than discarded by the too automatically).

3) Using all bits would make the process more clear, as a mnemonic should reflect all underlying bits used from the initial entropy whether pasted or generated by the tool, and would allow users to also use the app to make a mnemonic for something that was non-wallet related (i.e. a password or other data, plain text or ciphertext related).

Regarding bit loss, this can be resolved by selecting a fixed number of words, which will then use all bits of entropy to generate a fixed length mnemonic. See https://github.com/iancoleman/bip39/issues/33 and surrounding discussion (main point is quoted below)

Rather than scaling like this, where only the leading bits affect the outcome, I would use an SHA-2 hash of all the bits and take the first N bits of the result, so that all the input bits affect the outcome [when generating a fixed length mnemonic].

To directly respond to questions,

I don't think it is obvious to users that entropy is otherwise discarded as all bits remains pasted and visible, and even reflected in the raw entropy section.

Agreed, the raw entropy reflects what is filtered, but not what is actually used for the entropy after truncating. A second field called 'Truncated entropy' should be added so there's no confusion.

If a user has pasted bits into the tool, then all such bits should be considered "good bits" and there should be no bad bits left

I agree in general all bits should be good bits. If the user wants to use all bits they should select a fixed length mnemonic, or enter entropy of length 32N. I think rather than show a warning, having a second field for 'Truncated entropy' will be a clear indicator if some entropy has been discarded. A warning seems like incorrect fear. There would be some cases where it would not be possible to have no warning (eg using cards which is 4.8 bits per card).

Using all bits would make the process more clear, as a mnemonic should reflect all underlying bits used from the initial entropy

Users should select a fixed length mnemonic instead of Use raw entropy. Does the extra 'Truncated entropy' field add enough clarity to satisfy this requirement?

My proposed changes:

add new field for Truncated Entropy
Show the option text for fixed length mnemonics as "12 words - hash(raw binary)[0:128]"

Do these changes address the issue? I want to avoid adding more warnings, there are already so many warnings and this doesn't seem like a situation of danger, rather of ignorance (which should be fixed with more information rather than a warning).

Hi Ian, thanks for the feedback, I like the direction you are going with the solution and think that at the minimum showing the entropy that has been truncated will indeed make it more clear to users in that type of situation so they can make better sense of the data the tool is showing them and then deduce what is happening more easily. This way if someone posts more bits than is needed but not enough to reach the next fixed length threshold, at least they will see (more easily) which bits are being used and which are not. Great idea.

P.S. I think more clear meaning for option text (for non-tech users) would be: "12 words - only first 128 bits are used" or something to that extent.

Potential future considerations: While I still think it would be best and most ideal to use all entropy bits and simply not include a checksum in those cases and instead inform users that their resulting mnemonic is non-bip39 compliant in terms of checksum (since the checksum would be absent in those cases, such as a 13-word mnemonic) but can otherwise still be used to backup their data and all related bits pasted (ie. if 143 bits were pasted), and while padding any remaining bits in the final word group after "total bits % 11" so it is treated as an 11-bit number as the last word. (only issue with this is that the user when trying to recover the original binary data from the mnemonic would have to potentially discard those padded zeroes which at most could be ten of them in the case of 0000000001).

Although I realize there could also be confusion for those pasting say 129 bits, as the tool would end up providing a 12 word mnemonic that is not bip-39 compliant in terms of checksum, unless it was made very clear that the checksum is missing (i.e. some warning, which I know is not ideal) perhaps if they had the "option" in that case, and it was presented in a more educational manner. For example, if a user pasted 129 bits, the tool could ask whether to discard the last bit and use the 128 bits to created a checksummed mnemonic, or whether to use all 129 bits and zero-pad the last 4 bits to reach 132 for a checksum-free one. That seems to me like a logical next-gen solution to the advanced section unless it could cause other issues that are not obvious at the moment. Look forward to discussing in due time and hearing any feedback. Thanks.

iancoleman / bip39

Proposal to Fix Advanced Entropy Section, w/ Resulting Mnemonic to avoid Bit Loss (i.e. bits 129-159) #289