pytries / DAWG

DAFSA-based dictionary-like read-only objects for Python. Based on `dawgdic` C++ library.
http://dawg.readthedocs.org
MIT License
299 stars 46 forks source link

Feature Request: Ability to have byte string keys #3

Open dan-blanchard opened 11 years ago

dan-blanchard commented 11 years ago

I'm trying to store a large number of 4-grams in memory to speed up computation as part of a larger program I'm working on, and I think I found a reasonable way to do it using your DAWG library, but I ran into an issue because you currently don't allow byte string keys (I'm not sure if this is a limitation of your wrapper or the underlying library).

The idea I had was to have one DAWG that maps from words to unique numbers, and then store the 4-gram counts in another DAWG by converting the 4-grams to struct-packed byte strings of the format 'LLLL' where each packed item in the byte string is one of the unique numbers from the word DAWG. However, I can't currently do this because the keys are required to be unicode.

Is this something that would be feasible to add in the future?

kmike commented 11 years ago

Underlying dawgdic C++ library doesn't allow zero bytes in keys (because it uses null-terminated strings) - that's why the wrapper disallows arbitrary bytestring keys. It stores utf8-encoded unicode and utf8 is guaranteed not to have zero bytes. Struct-packed 'LLLL' numbers will have zero bytes for sure.

But the wrapper supports binary values (in BytesDAWG and RecordDAWG). This is implemented using a hack: binary data is encoded/decoded to/from base64.

I don't know the best solution for N-Gram storage.

You may try https://github.com/kmike/marisa-trie - its base storage is not as efficient and fast as DAWG but it supports perfect hashing (each key is assigned an unique number "for free" - storing unique numbers in DAWG may be costly) and it also supports binary keys (including keys containing zero bytes).

Another option is to just store N-grams in DAWG unpacked (e.g. whitespace-separated); I'd not be surprised if this will be quite efficient.

kmike commented 11 years ago

See also: https://code.google.com/p/dawgdic/issues/detail?id=4&can=1