pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.03k stars 91 forks source link

Weight Ordering Doesn't Seem to Work? #23

Closed EliFinkelshteyn closed 9 years ago

EliFinkelshteyn commented 9 years ago

I've tried this with both RecordTrie and BytesTrie and a number of different formats, but can't get things to work no matter what I do. keys = ['foo', 'foo1', 'foobar', 'bar'] values = [(1,), (2,), (3,), (4,)] fmt = str("<H") trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER) trie.items(u'') >>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]

I tried adding a weights parameter as well, which isn't documented, but does seem supported in the code. It looks like it does something, because if an iterable with items that can't be converted to floats is passed in, it breaks. Nevertheless, values are still not returned in weight order:

trie = marisa_trie.RecordTrie(fmt, zip(keys, values), order=marisa_trie.WEIGHT_ORDER, weights=[1,2,3,4]) trie.items() >>> [(u'foo1', (4,)), (u'foobar', (3,)), (u'foo', (1,)), (u'bar', (2,))]

Am I missing something here? Is there some way other way to set weight?

kmike commented 9 years ago

Hi @EliFinkelshteyn,

This feature of marisa-trie C++ library is confusing - see http://marisa-trie.googlecode.com/svn/trunk/docs/readme.en.html, "Node Order" section. It seems 'order' sets an order in which nodes are stored, not necessarily the order in which values are returned.

If you set order to LABEL_ORDER it is guaranteed values are returned alphabetically. But if you set order to WEIGHT_ORDER and pass weights then values with higher weights will be looked up faster; there is nothing in marisa-trie C++ library docs about returned values order in this case. Also, MARISA doesn't store these weights, they are used at build time to arrange nodes.

kmike commented 9 years ago

I'm not sure it is possible to do better than having a separate array with weights and sorting the results on-demand. The best way to map values to weights is to use trie.key_id and trie.restore_key methods; they are tricky with BytesTrie and RecordTrie though.. Maybe instead of array of weights you can use an array of ranks.

EliFinkelshteyn commented 9 years ago

Oh, odd. OK, thanks!