pytries / marisa-trie

Static memory-efficient Trie-like structures for Python based on marisa-trie C++ library.
https://marisa-trie.readthedocs.io/en/latest/
MIT License
1.04k stars 92 forks source link

``BinaryTrie`` #29

Closed liori closed 8 years ago

liori commented 8 years ago

BinaryTrie: a Trie that does not attempt to encode its keys as UTF8.

I am storing lots of SHA hashes in many large Tries (hundreds of megabytes each, potentially hundreds of such tries at any time). I used to store them as a hex digest, but then noticed that my Tries are often significantly smaller when I use a binary encoding. Savings of tens of gigabytes in my use-case. Except… marisa_trie forces all keys to be unicode, so it fails with '\0'.encode('utf8').

This pull request is an attempt to implement a Trie which is binary-safe in keys. BinaryTrie basically a stripped-down Trie class, where I removed every .encode('utf8') call and fixed some small bugs related to handling \0. Tests are just a slightly edited copy of Trie tests.

I haven't tested the code under any python other than 2.7 and 3.4, as only these two are available on my system. I am also not sure yet if the implementation is correct. I will test it more; now I'm just requesting comments to the implementation.

EliFinkelshteyn commented 8 years ago

For what it's worth, this would be useful to me as well. Any updates on whether this might be merged?

superbobry commented 8 years ago

I'm okay with this change. Will review the code later this week.

@kmike what do you think?

kmike commented 8 years ago

Yeah, I think that's a nice feature. It'd be good to check that there are no serious performance regressions (a few % is fine).

liori commented 8 years ago

Hi, I've changed NoPrefixGiven into None, added the necessary .cpp files and added myself to the authors file. I hope I haven't broken anything in the process.

superbobry commented 8 years ago

Everything looks good. Thank you very much, @liori!

@kmike I'll do the benchmarks after the merge.

kmike commented 8 years ago

Thanks @liori and @superbobry!

liori commented 8 years ago

One more thing missing—documentation update. I'll try to prepare something over the weekend.

superbobry commented 8 years ago

I've wanted to port the current docs to Sphinx for quite a while, so if you have the time, don't hesitate :)