Open dan-blanchard opened 11 years ago
Underlying dawgdic C++ library doesn't allow zero bytes in keys (because it uses null-terminated strings) - that's why the wrapper disallows arbitrary bytestring keys. It stores utf8-encoded unicode and utf8 is guaranteed not to have zero bytes. Struct-packed 'LLLL' numbers will have zero bytes for sure.
But the wrapper supports binary values (in BytesDAWG and RecordDAWG). This is implemented using a hack: binary data is encoded/decoded to/from base64.
I don't know the best solution for N-Gram storage.
You may try https://github.com/kmike/marisa-trie - its base storage is not as efficient and fast as DAWG but it supports perfect hashing (each key is assigned an unique number "for free" - storing unique numbers in DAWG may be costly) and it also supports binary keys (including keys containing zero bytes).
Another option is to just store N-grams in DAWG unpacked (e.g. whitespace-separated); I'd not be surprised if this will be quite efficient.
I'm trying to store a large number of 4-grams in memory to speed up computation as part of a larger program I'm working on, and I think I found a reasonable way to do it using your DAWG library, but I ran into an issue because you currently don't allow byte string keys (I'm not sure if this is a limitation of your wrapper or the underlying library).
The idea I had was to have one DAWG that maps from words to unique numbers, and then store the 4-gram counts in another DAWG by converting the 4-grams to struct-packed byte strings of the format 'LLLL' where each packed item in the byte string is one of the unique numbers from the word DAWG. However, I can't currently do this because the keys are required to be unicode.
Is this something that would be feasible to add in the future?