pytries / DAWG

DAFSA-based dictionary-like read-only objects for Python. Based on `dawgdic` C++ library.
http://dawg.readthedocs.org
MIT License
300 stars 47 forks source link

struct.error: bad char in struct format #27

Open farscape2012 opened 8 years ago

farscape2012 commented 8 years ago

Hi,

I am trying to create a RecordDAWG object which contains tuple that consists of different data type. But there was error.

Does RecordDAWG only accept numeric tuple ?

data = [(u'key1', (1, b'a')), (u'key2', (2, b'b')),(u'key3', (3, b'c'))]

dawg.RecordDAWG(data) Traceback (most recent call last): File "", line 1, in File "dawg.pyx", line 830, in dawg.RecordDAWG.init (src/dawg.cpp:13810) struct.error: bad char in struct format

Br, Eric

superbobry commented 8 years ago

Hi Eric, you must specify a struct format to use RecordDAWG. Something like "=i1s" should work for your data.

farscape2012 commented 8 years ago

Thanks. I've figured it out. But my application is a little bit complicated. If I understand correctly RecordDAWG always save fixed size of binary data. I have a various length of string (the second element int the tuple). An example is shown below. Once I search with key, the returned value contains a lot of "\x00" in the end. Is it possible for dawg to support varying size of binary data ? Is there any plan to proceed in that direction?

Example: 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

kmike commented 8 years ago

@farscape2012 you can use BytesDAWG and encode/decode data yourselves.

farscape2012 commented 8 years ago

Thanks kmike for suggestion. I think I had tried BytesDAWG. It seems the value should be bytes, not accept int. `

data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', 213)] bytes_dawg = dawg.BytesDAWG(data) Traceback (most recent call last): File "", line 1, in File "dawg.pyx", line 480, in dawg.BytesDAWG.init (src/dawg.cpp:8932) File "dawg.pyx", line 296, in dawg.CompletionDAWG.init (src/dawg.cpp:6625) File "dawg.pyx", line 42, in dawg.DAWG.init (src/dawg.cpp:2050) File "dawg.pyx", line 479, in genexpr (src/dawg.cpp:8735) TypeError: Expected bytes, got int `

kmike commented 8 years ago

@farscape2012 yes, values should be bytes. The only thing RecordTrie does differently from BytesDAWG is that it converts data to/from bytes using a predefined record format (it uses https://docs.python.org/3/library/struct.html from standard library). With BytesDAWG you need to convert data from/to bytes yourselves.

farscape2012 commented 8 years ago

Thanks kmike again. How about the order of values ? Is order of values kept when they are added?

Does dawg have a sub/class which supports dictionary value, not only bytes and int? That will make programming far easier.

kmike commented 8 years ago

@farscape2012 key/value pairs are sorted by their binary value. Internally there are no values - values are just appended to corresponding keys after a separator, and the resulting strings are stored in DAFSA. Storing them in DAFSA makes sense when you think that values can be compressed in a similar way as keys. So e.g. adding an unique integer as a value will make DASFA "explode" almost to a Trie, this is inefficient.

If you want to attach arbitrary data to keys then DAFSA is likely a wrong data structure. You may try e.g. https://github.com/pytries/marisa-trie or https://github.com/pytries/hat-trie. With marisa-trie you have an unique ID per key, 0 <= key_id < len(trie); to store arbitrary data just create a Python list of the same length as a Trie and put values at key_id index. HAT-Trie supports Python objects as values natively.

farscape2012 commented 8 years ago

@kmike Thanks again. Good to know that the values are appended, which means order remains. In my cases, the integer is a not unique, they are just arbitrary. In may case I needed to save values for a key. For now for each key there are three types of values, an integer (arbitrary) and a string. Maybe later the number of elements of value could increase.

Summarizing what you @kmike and @superbobry have suggested, I could proceed in two ways:

  1. use ByteDAWG and apeend values to the key (the order remains). The values should be converted into bytes manually.
  2. continue using recordDAWG like I did, in which case the memory usage is not that sufficient.

Considering scalablity, speed performance, memory efficiency, which method do you guys suggest?

Thanks. BTW, I had checked marisa-trie before I started to use dawg. I felt that DAWG is much easier to use and has better functionality.