vbuterin / pybitcointools

SImple, common-sense Bitcoin-themed Python ECC library
1.28k stars 856 forks source link

mnemonic bisect_left will fail (ungracefully) on other bip39 wordlists #132

Open wizardofozzie opened 8 years ago

wizardofozzie commented 8 years ago

Bisect only works for sorted wordlists; not an issue if only english.txt is used, however, forks using standard bip39 wordlists will fail most ungracefully.

eg (Japanese wordlist, forum post)

cryptid11 commented 8 years ago

Yes, this line is wrong

wordlist_english=list(open(os.path.join(os.path.dirname(os.path.realpath(file)),'english.txt'),'r'))

Steve132 commented 8 years ago

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and ALL the wordlists being lexographically sorted is a part of the bip39 spec

  • the wordlist is sorted which allows for more efficient lookup of the code words (i.e. implementations can use binary search instead of linear search)
    • this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

vbuterin commented 8 years ago

Perhaps it might be best for convenience reasons to include an assert loop in the code that makes sure a list is sorted before using it? This seems to have come up many times now.

On Wed, Jan 20, 2016 at 4:58 PM, Steven Braeger notifications@github.com wrote:

however, forks using standard bip39 wordlists will fail most ungracefully.

english.txt is the standard bip39 english wordlist, and the wordlists being lexographically sorted are a part of the bip39 spec https://github.com/bitcoin/bips/blob/master/bip-0039.mediawiki#Wordlist

  • the wordlist is sorted which allows for more efficient lookup of the code words (i.e. implementations can use binary search instead of linear search)
    • this also allows trie (a prefix tree) to be used, e.g. for better compression

However, you aren't wrong about the bisection algorithm being incorrect, but for several other completely different reasons. I'm going to submit a patch soon.

Furthermore, english.txt is only provided for convenience. If you want to use another wordlist, you can pass it as a sorted list to the 'wordlist' parameter in most of the mnemonic functions

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173374575 .

wizardofozzie commented 8 years ago

@vbuterin In my fork there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR)

Steve132 commented 8 years ago

Actually, I've been doing some research with this and what should really happen is that the search needs to degrade to a linear search in the event that sorted(wordlist) != wordlist...OR a hashtable should be used to compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index needs to match the index of the word AS SEEN IN THE FILE...whereas implementing binary search or resorting the input list will use the CURRENT LOCALE for the comparisons, thus possibly permuting the file order and thus making the detected indices incorrect.

The correct behavior will be one of the following options:

1) Always implement linear search (this is what my current version does in my branch) Upside: this is simple and always works. Downside, it's O(n) for every lookup 2) Implement a hash table that maps words->indices. Upside: This is simple-ish. Downside: it's O(n) for every lookup unless we implement it as a pre-processing stage (like have an internal wordlist type that indicates a correct hash table, and if the wordlist argument is a list convert it and cache the result or something) 3) Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie notifications@github.com wrote:

@vbuterin https://github.com/vbuterin In my fork https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101 there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR) NB.

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173546537 .

vbuterin commented 8 years ago

3) Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

If you're computing the sorted list anyway, would it not be simpler to just cache the sorted list for each list?

On Thu, Jan 21, 2016 at 1:47 PM, Steven Braeger notifications@github.com wrote:

Actually, I've been doing some research with this and what should really happen is that the search needs to degrade to a linear search in the event that sorted(wordlist) != wordlist...OR a hashtable should be used to compute the index based on the position in the file..

This is because in order for the wallet to match the spec, the search index needs to match the index of the word AS SEEN IN THE FILE...whereas implementing binary search or resorting the input list will use the CURRENT LOCALE for the comparisons, thus possibly permuting the file order and thus making the detected indices incorrect.

The correct behavior will be one of the following options:

1) Always implement linear search (this is what my current version does in my branch) Upside: this is simple and always works. Downside, it's O(n) for every lookup 2) Implement a hash table that maps words->indices. Upside: This is simple-ish. Downside: it's O(n) for every lookup unless we implement it as a pre-processing stage (like have an internal wordlist type that indicates a correct hash table, and if the wordlist argument is a list convert it and cache the result or something) 3) Detect if sorted(list) == list when a non-standard list is discovered. If it does, use the binary search path. Otherwise, use a linear list. This result will also have to be cached.

On Thu, Jan 21, 2016 at 6:44 AM, WizardOfOzzie notifications@github.com wrote:

@vbuterin https://github.com/vbuterin In my fork < https://github.com/wizardofozzie/pybitcointools/blob/master/bitcoin/mnemonic.py#L101

there's an assert for bip39_check. I believe the checksum should flag mis-sorted word lists. (fwiw my code diverges for bip39 because it was written prior to this branch's PR) NB.

— Reply to this email directly or view it on GitHub < https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173546537

.

— Reply to this email directly or view it on GitHub https://github.com/vbuterin/pybitcointools/issues/132#issuecomment-173671152 .

Steve132 commented 8 years ago

No, because then the result would be incorrect. The whole point is that the spec says that the 11 bits correspond to the index of the word in the given wordlist file, in the order given in the file. This is important, because the spec doesn't specify how the list is supposed to be searched, only that the result must be consistent with the order given in the standard file for compatibility.

They are supposed to be sorted in the file, but the bug appears because the sort order for comparing two strings is locale-dependant...that is, if we compute cached_lists[id(wordlist)]=sorted(wordlist), and the code runs in a locale different than the one that generated the original standard file, the sort order will be WRONG and produce the incorrect 11 bits on a lookup, because the words will appear at a different index than the one in the file...which is unacceptable for obvious reasons

You could compute cached_hashtables[id(wordlist)]=dict([(value,index) for index,value enumerate(wordlist)]) and that would be correct AND be O(1), but that's some significant complexity when linear search through 2048 strings is probably not THAT expensive relative to how often this is likely to be needed, ESPECIALLY in python where it's unlikely to be a high-performance embedded application anyway.

Steve132 commented 8 years ago

I fixed this with a pull request https://github.com/vbuterin/pybitcointools/pull/141

It uses a hash table for the index lookup and word check operations which is O(1) and doesn't require locale-dependant string comparisons to do either operation. All functions which expect a wordlist can use this type as the wordlist for improved performance, but they all degrade gracefully to O(n) cases if the user provides a list as the wordlist instead of an instance of the type.

It also adds support for ALL the languages in bip39, which are now selectable. English remains the default, but you can use any of the others without needing to load them yourself (e.g. bitcoin.words_verify(spanish_phrase,wordlist=bitcoin.wordlists['spanish'])

wizardofozzie commented 8 years ago

@Steve132 I've just taken a cursory glance, but it looks fantastic so far, great work! Take a look at this discussion. Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

Steve132 commented 8 years ago

Unfortunately wordlists will always have ongoing issues, both because of different standards (electrum 2.0 / bip39) and locale

This patch fixed the locale-based search issues because it uses only the original index of the standard wordlist, NOT the index calculated from a sorted wordlist or a binary search with string comparisons. It remains fast by using a hash table.

I think someone should patch the BIP39 specification where it says the lists must be sorted so as to enable binary search, as we've seen binary-search is not possible to do in a locale-independant way. Instead perhaps the specification should have a recommendation that a hash function be used, perhaps even including C code and look up tables for a minimum-perfect-hash implementation for all languages.

reiven commented 8 years ago

@wizardofozzie hi! i'm trying to use your fork, but i found the tests are failing. can you add issues to your proyect, so users can report this kind of errors? thanks!

wizardofozzie commented 8 years ago

@reiven Absolutely, mate. I've been dabbling in iOS Pythonista 2.0, Pythonista 1.6 and another branch, so I'll be consolidating and testing the code ASAP (~1-2 days maximum)