michaelsproul / rust_radix_trie

Fast generic radix trie implemented in Rust
https://docs.rs/radix_trie/
MIT License
187 stars 33 forks source link

Let any vector be the key of the trie. #28

Open Albibek opened 8 years ago

Albibek commented 8 years ago

The idea: The most natural thing for the prefix implementation is Vec<T>. The Trie is currently implemented through serialization of the key value into Vec<u8>. Could be great if Vec<T> was usable as 'plain' key, without making it into Vec<u8>, probably with some limitations to T. Such limitations could probably be some kind of TrieKeyItem trait, meaning that a type can be a single item of a key making TrieKey into TrieKey<Item=Something>. This brings alot of convenience and univerality to the datastructure. Strings, for example could be stored as bytes or unicode symbols depending of user needs. The Vec<u8> in this case could be made into some external thing and only be used for types that could not be easily converted into corresponding vector implementation.

It seems to me that first step could be to "just" convert u8 to some internal generic type, probably not seen to user, but being replaced to Item type when the Vec(Iterable?, Indexable?) has been provided as a key.

The fallback: In the case the idea below is impossible(in the context of this library of course), we could probably try implementing TrieKey for Vec<T: TrieKey> at least. That will probably require some unneeded computation when converting from/into such vectors but will still be very useful.

michaelsproul commented 8 years ago

This is a good idea. We just need to think of a way to do it simply and efficiently. One option would be to encode each portion of the key as bytes and store something like a Vec<Vec<u8>> on each "edge" of the trie (the TrieNode.key field at the moment). We would have to deal with keys getting cut in half under this scheme, as its quite natural to want a trie keyed by something like Vec<String>. For example, for URL routing, you could encode a path like "/download/file1.txt" as vec!["download", "file1.txt"], which would split a key if "/download/file2.txt" were already in the trie.

Another way I've thought about implementing this is to define a separator byte that can be used to delineate the different elements of the vector. For filesystem and URL paths, the byte encoding of / could be used, as it's guaranteed not to appear in paths.

Albibek commented 8 years ago

Well, my thought is that is more of a user's concern to decide the split point of his/her data. Wouldn't it be more natural and useful to ask a library user to provide data already split as Vec or, if possible as an Iterable object(it is possible btw?). In the example of Vec I think we should consider user wanting his strings to be the atomic part of the tree. If user wanted his strings to be split by some byte, or any other criteria, it's his/her decision to give us a Vec or Vec where strings are split properly already.

michaelsproul commented 8 years ago

Maybe something like the sequence trie API? https://docs.rs/sequence_trie/0.2.1/sequence_trie/

Albibek commented 8 years ago

That depends, if the radix trie itself is compatible with iterable structures. The other question is effectiveness of iteration-only structure and it's compatibility with different optimizations done by llvm, like vectorizations etc.