SortedLists instead of hash tables causes higher order complexity super/subset lookups

The paper on this topic was not very good in its complexity analysis trying to claim a constant bound in practice when it should have listed specific worst-time and average-case complexities. It managed to do quite well getting a few citations for the novelty of an idea which had not been published before, but the mathematical analysis was poor.

The paper left sorting as not a necessary part of an implementation. But with sorting given an m-length set to find super/subsets, with c-size of a given nodes children in the trie, and d-number of nodes of given subtree of c: Worst case for supersets is O(m*c*d) as m-depth, c-comparisons, d-nodes searched where c and d are different at each iteration with linear search. With binary search it is O(m*d*log c). With hash tables it is O(m*d). For subsets it is O(m^2*c*d) as m-depth, mc comparisons, d-nodes searched. With binary search it is `O(cdm log m)and with hash tables it isO(md*min(m, c))whereO(min(m, c))is the complexity of set intersections for at least the Pythonset()` object. Binary search theoretically could be worse in average case depending on the data distribution but in general it will out-perform.

The average cases which are more important are more difficult to calculate due to having to reason about the distribution of the trie, and the distribution of the set, and being perfect about it is quite difficult as even metrics like total trie-depth, average child nodes per trie node, may not be right for representing a totally random distribution or something like a Gaussian distribution which is also a useful random average case.

I think switching to a Python set object, forgetting about SortedList, and simply taking the output of the iteration functions and putting it through sorted which is n log n complexity would yield significantly faster results. Adding in sorted order should be kept, as it will help to minimize the size of the trie in terms of memory.

mmihaltz / pysettrie

SortedLists instead of hash tables causes higher order complexity super/subset lookups #14