lhmei / dawgdic

Automatically exported from code.google.com/p/dawgdic
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Binary keys #4

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

It seems there is no way to store binary data containing 0 bytes in DAWG, at 
least it seems that Completer doesn't support it (completer.key() and 
completer.length() returns values truncated at first 0 byte).

My use case is to store binary payload as a part of keys. The keys are 

<utf8-encoded unicode key> + chr(255) + <binary_value>

this will allow implementing mapping from unicode to a list of possible binary 
values using DAWG where values will be compressed just like keys; in order to 
get all values Completer class may be used. 

Zero bytes are not an issue for the utf8 keys because utf8 text is guaranteed 
to not include them. chr(255) also can't be a part of utf8-encoded string so it 
is fine as a separator. But zeroes in binary data renders this scheme 
non-working. 

Is there a workaround? Thanks.

Original issue reported on code.google.com by kmik...@gmail.com on 21 Aug 2012 at 10:19

GoogleCodeExporter commented 9 years ago
Small clarification: I was not correct about utf8: utf8 may include zero byte 
because NUL unicode codepoint is represented as zero byte in utf8. There are no 
other possible zero bytes in UTF8 (multibyte codepoints have some bits set in 
all bytes) so this is usually not an issue.

Original comment by kmik...@gmail.com on 25 Aug 2012 at 8:31

GoogleCodeExporter commented 9 years ago
In a Python wrapper I've resolved this by encoding <binary_value> to base64. 
Encoding/decoding speed doesn't become bottleneck (when using 
http://libb64.sourceforge.net/ decoding is orders of magnitude faster than 
Completer iteration speed).

Original comment by kmik...@gmail.com on 4 Sep 2012 at 9:38

GoogleCodeExporter commented 9 years ago
I'm sorry that I couldn't help you.

As you mentioned, Completer uses '\0' as the sentinel and it seems to be 
impossible to fix this without changing the structure of its auxiliary data. 
Also, the change will increase the size of the auxiliary data.

So, now I'm planning to implement a new Completer for binary keys.

Original comment by susumu.y...@gmail.com on 7 Sep 2012 at 12:10

GoogleCodeExporter commented 9 years ago
Thanks! This would be a very useful feature.

Original comment by kmik...@gmail.com on 7 Sep 2012 at 12:42

GoogleCodeExporter commented 9 years ago
I'm really sorry.
It has turned out that it is impossible to support binary keys without changing 
its core data structure (major update).

When I posted the last comment, I expected that dawgdic can support binary keys 
without changing Dictionary.
However, I've found that not only Completer but also Dictionary require 
modification.
In addition, the modification degrades the space efficiency and the time 
efficiency of dawgdic.
This drawback may be unacceptable.

In short, base64-encoding may support binary values better than natively 
supporting binary keys.

Original comment by susumu.y...@gmail.com on 14 Sep 2012 at 12:16

GoogleCodeExporter commented 9 years ago
Thanks for all your work on this. Binary keys would be good but base64 works 
good enough in practice. So maybe the limitation should be documented somewhere?

Original comment by kmik...@gmail.com on 14 Sep 2012 at 12:20

GoogleCodeExporter commented 9 years ago
Thanks for your comments.
I'll update dawgdic to reject binary keys and then update documentation.

Original comment by susumu.y...@gmail.com on 14 Sep 2012 at 12:40

GoogleCodeExporter commented 9 years ago
Looks like a good plan.

Original comment by kmik...@gmail.com on 14 Sep 2012 at 12:52

GoogleCodeExporter commented 9 years ago
I've finished the update of dawgdic.
Also, the documentation has been updated.

Original comment by susumu.y...@gmail.com on 15 Sep 2012 at 6:51