Open manthanchauhan opened 5 years ago
I'd like to contribute code for this, if required.
I’m always happy to review pull requests though adding a ‘compressed’ representation to the library is rather involved change so this might benefit from discussion of the design before code is written. There are non-obvious cases to consider such as how would the data structure look with trie['foo'] = trie['foobar'] = trie['foobarbaz'] = 42
.
Lastly, you should be looking at https://github.com/mina86/pygtrie repository since that’s where the latest code resides.
One solution for the example you provided is, node decomposition, as shown, trie['foobarbaz'] = 42
---foobarbaz: 42
trie['foobar'] = 42 no we will perform sub-string matching of 'foobar' starting from the keys at root node. From the first character in 'foobar' we'll find a key in the node such that it has the same corresponding characters.
in this example, 'foobar' matched with 'foobarbaz' upto 5th position, therefore, the root node breaks at the 5th position creating,
---foobar: 42
|-- baz: 42
after trie['foo'] = 42,
---foo: 42
|-- bar: 42
|-- baz: 42
It is to be noted that, at a node, no two keys start form same character and hence time complexity of addition is unaffected after compression.
In the above trie, if I perform, trie['foobie'] = 9
the resulting structure would be,
---foo: 42
|-- b
|--ie: 9
|--ar: 42
|-- baz: 42
ie, node with key 'bar' decomposed at 0th position ('b').
What if I then did trie['bar'] = 42
? If I understand you correctly, you suggest that the root node would keep ‘foo’ as the key in children
dictionary but that won’t quite work and goes opposed to how trie is meant to behave.
If you do trie['bar'] = 42, the root node will now have two children, one with key 'foo' and another with key 'bar'. I don't understand your point, how does it violates any behaviour of a trie.
When a lookup for ‘foo’ is done, trie’s behaviour is to look for a child ‘f’ of the root node. If root’s node children wolud hold ‘foo’ and ‘bar’, you’d never find ‘foo’ in there.
Look-ups would be implemented similar to additions, at each node we find the key which has the same first character as that of the query,
If I were to look-up for 'foobar' in the following trie,
---foo: 42
|-- b
|--ie: 9
|--ar: 42
|-- baz: 42
at root node 'foo' starts with an 'f' just like 'foobar', this indicates that 'foobar' must belong to the subtrie at 'foo'.
in the 'foo' subtrie now we lookup for 'bar', this time 'b' matches with 'bar',
now in 'b' subtrie we look-up for 'ar' where it matches with 'ar', since the query has been found, we stop here.
Yes, at every node, we need to traverse to all the children until we find the matching first character, therefore time complexity increases. But, lots of storage is saved which is often helpful for storage concerned applications.
The _Node
object has two fields: children
and value
. In your diagram, what does:
---foo: 42
|-- b
correspond to? What’s the value of children
and value
fields in the root node?
Because if the root node has children
field equal to a {'foo': <some-node-object>}
than there is no efficient way to find the foo
key when looking for foobar
since the way trie works is that you will look for a child under f
.
When I say that design discussion may be in order, I mean discussion of how each kind of node is represented. What are children
and value
fields values and how are key lookup, addition and deletion performed.
Your interpretation of children field is correct.
after compression, now trie keys are strings and hence now at each node, rather than looking of a character like 'f' we look for strings beginning with 'f' like 'foo',
at each node, log(N) complexity is required to find a key, where N is the total number of children of that node. Yes the operations are not time efficient, but this is the cost of the saved memory space.
If I'm right, combining each pair of nodes saves 56B.
at each node, log(N) complexity is required to find a key, where N is the total number of children of that node.
I don’t see how that’s correct. The actual complexity is linear to the length of the prefix. Recall that Python uses hash maps for dictionaries, not binary search trees. As a result, when looking for child ‘foobar’ you have to first check ‘f’, than ‘fo’, then ‘foo’ etc.
At that point it’s simpler to just have a flat dictionary and do key lookups by doing longer and longer prefix matches.
Actually, at one node there will be only one key starting from 'f' or any other character. In the following trie,
---foo: 42
|-- bar: 42
if I do trie['foob'] = 9
this happens,
---foo: 42
|-- b: 9
|--ar: 42
the node with key 'bar' decomposes at 'b' to avoid having both 'bar' and 'b' as keys.
Once we match the first character of query (in O(logN) ), we can simply traverse the matched key to find out upto which position it matches with the query ie, 'foo' with 'foob' upto 2nd position, then in the subtrie we search for remaining part of the query ie. the remaining 'b' of 'foob'.
So then also if I add trie['bar'] = 42
the root node gets decomposed?
no, in this case the root node will have one more key as 'bar', due to the fact that no existing key of root node matches with 'b'.
So again, how do you look up a key in such a node?
If sorting is enabled, as done by the function enable_sorting()
, we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.
otherwise, if sorting is not enabled we have to perform a linear search.
Do note that, query string will change for each node, depending upto which character the match occurs in previous node.
If sorting is enabled, as done by the function
enable_sorting()
, we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string.otherwise, if sorting is not enabled we have to perform a linear search.
I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.
I don’t see getting away from children being keyed by a single character (in case of CharTrie
). The path could be compressed but that would need to be for the remaining of the prefix, something like:
trie['foo'] = 1
trie['foobar'] = 2
trie['foobarbaz'] = 3
trie['bar'] = 4
\+- f -> (oo) -> 1
| \-- b -> (ar) -> 2
| \-- b -> (az) -> 3
\- b -> (ar) -> 4
The question then is how to encode and handle the compressed path.
Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ).
You suggest that we use only the first character of a path as a key, right? which should save the search time of a key.
On Tue, May 28, 2019, 09:03 Manthan Chauhan notifications@github.com wrote:
Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ).
When sorting is enabled children are not stored sorted. A hash table is still used. With sorting enabled children are sorted each time items are iterated over.
As such, there is no way to take advantage of sorting being enabled during key lookup. The chdren would need to be sorted each time which means lookup takes O(n log n).
You suggest that we use only the first character of a path as a key, right?
which should save the search time of a key.
Yes. That's how tries work.
Thinking about your idea for character keys,
If sorting is enabled, as done by the function
enable_sorting()
, we can apply binary search on all the keys of a node to search that key which starts with a particular character c, where c is the first character of query string. otherwise, if sorting is not enabled we have to perform a linear search.I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key.
I don’t see getting away from children being keyed by a single character (in case of
CharTrie
). The path could be compressed but that would need to be for the remaining of the prefix, something like:trie['foo'] = 1 trie['foobar'] = 2 trie['foobarbaz'] = 3 trie['bar'] = 4 \+- f -> (oo) -> 1 | \-- b -> (ar) -> 2 | \-- b -> (az) -> 3 \- b -> (ar) -> 4
The question then is how to encode and handle the compressed path.
at each node, we can use the first character of the query string to locate a key-value pair, here, each value is a collection of a string and a node pointer.
once we locate a key-value pair, we traverse the associated string upto the position it matches with the query string, which would be equivalent to traversing that much nodes of an uncompressed CharTrie. The remaining part of the query string will be passed to the pointed _Node.
This sounds workable. I only wonder how would it interact with interfaces
such as walk_towards
.
sorry but can you please provide some references for this walk_towards
interface?
This sounds workable. I only wonder how would it interact with interfaces such as
walk_towards
.
For this to work, we'll need to modify the _path_from_key() functions, because in this proposed model of compact tries, the key
of a node is not essentially it's path
from the root.
trie['foo'] = 1 trie['foobar'] = 2 trie['foobarbaz'] = 3 trie['bar'] = 4
+- f -> (oo) -> 1 | -- b -> (ar) -> 2 | -- b -> (az) -> 3 - b -> (ar) -> 4
in this trie structure, if we consider the key foobarbaz
, the path for this node should be fbb
and not foobarbaz
.
with the above described path definition walk_towards
function will work unaffected.
I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so walk_towards
should go through the whole foobarbaz
regardless of how the internal node structure looks like.
I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so
walk_towards
should go through the wholefoobarbaz
regardless of how the internal node structure looks like.
Then we can have to do a little modification in the walk towards
function.
If we consider that the value at each children key is a tuple(string, Node),
then in the walk_towards
function we have to increment pos
in each iteration by,
pos += 1 + len( node.children.get( path [pos ] ) [0] )
Sorry about delay.
If we consider that the value at each children key is a tuple(string, Node), then in the
walk_towards
function we have to incrementpos
in each iteration by,
pos += 1 + len( node.children.get( path [pos ] ) [0] )
That doesn't sound viable for two reasons.
First of all, the result of walk_towards
should not depend on the internal structure of the trie. Whether compressed paths are used or not, iterating over the steps should yield all keys on the path to given key. In other words, the following test needs to pass:
import pygtrie
t = pygtrie.CharTrie(foobar=42)
steps = t.walk_towards('foobar')
res = ' '.join('<%s>' % step.key for step in steps)
assert res == '<> <f> <fo> <foo> <foob> <fooba> <foobar>'
Second of all, if the motivation is to reduce amount of memory used, I think only children with compressed paths should be a (sub-path, node)
tuple. This doesn't fundamentally change anything other than code needing to handle both cases.
There's also another aspect that ideally, the following test should pass as well:
from itertools import islice
import pygtrie
t = pygtrie.CharTrie(foobar=42)
res = ''
for step in t.walk_towards('foobar'):
res += '<%s>' % step.key
if 'fred' in t:
del t['fred']
del t['foobaz']
else:
t['fred'] = t['foobaz'] = 24
assert res == '<><f><fo><foo><foob><fooba><foobar>'
In other words, converting nodes between compressed and not-compressed should not break walk_towards
method.
Okay, but for compressed tries to pass the walk_through()
test, some of the returned path will not point to any Node
.
That is, if the trie contains,
foobar=5 foob=6
then while walk_through()
the path foo
will be returned without any Node
, if that is accepted.
Correct. This is the desired behaviour because it's the current behaviour (currently the method goes through all nodes regardless of whether they have values assigned to them or how many children they have) and methods should not expose details of the implementation.
I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so
walk_towards
should go through the wholefoobarbaz
regardless of how the internal node structure looks like.Then we can have to do a little modification in the
walk towards
function.If we consider that the value at each children key is a tuple(string, Node), then in the
walk_towards
function we have to incrementpos
in each iteration by,
pos += 1 + len( node.children.get( path [pos ] ) [0] )
For that, we can use this trick and inbetween every two valid Node
we can pass,
len(node.children.get( path [pos] ) [0] )
None
nodes to make up for the invalid keys inbetween.
I mean without actually traversing the path character by character.
char_trie['he'] = 3 char_trie['him'] = 4
---h
|e: 3
|i
|__m: 4
can be compressed as, ---h
|e: 3
|i m:4
since removing each redundant node saves atleast 56 bytes, this can provide high memmory optimisation to a data structure that is used for memmory optimized storage.