Change encoding used to read wordlist files to recognize BOM

bgillesp commented 3 years ago

The Unicode encoding UTF-8 allows an optional character at the beginning of the file called the "byte order mark", or BOM. In UTF-16 or UTF-32, this character represents whether the byte order of characters is big- endian or little-endian. The UTF-8 standard does not require or recommend the use of a BOM; however, a BOM may still be included in UTF-8 files for a number of reasons.

The BIP-39 wordlist file french.txt currently includes the byte order mark U+FEFF at the beginning of the file. The encoding method used in Mnemonic.__init__ to read this file is 'utf-8', which does not parse any BOM at the beginning of a file, and thus produces a Python list with first entry '\ufeffabaisser' instead of the correct string 'abaisser'. This in particular results in valid French mnemonic seed phrases starting with 'abaisser' to be incorrectly rejected by the Mnemonic.check validation function.

The commit in this pull request changes the encoding method used to read the wordlist from 'utf-8' to 'utf-8-sig', which causes Python to properly interpret BOM characters in UTF-8 files, and fixes the incorrect first entry in the French language Mnemonic object's wordlist.

prusnak commented 3 years ago

I removed the BOM from the French wordlist in ae726b39ad74323d7128f8991feb7e36f5b8a16c

I think that's much better fix, because it also prevents issues with alternative implementations using the same wordlist file.

Thanks for the report!

bgillesp commented 3 years ago

Great, that will do it -- I was hesitant to mess around with the wordlist file itself since it's so standardized, but your fix actually brings the file in line with the standard wordlist at the BIP-0039 repository. Thanks for the quick turnaround!

trezor / python-mnemonic

Change encoding used to read wordlist files to recognize BOM #83