mmihaltz / pysettrie

python3 package supporting efficient storage and querying of sets of sets using the trie data structure. Supports finding all the supersets/subsets of a given set from a collection of sets. Also includes a trie-based mapping container where the keys are sets.
GNU Lesser General Public License v3.0
24 stars 6 forks source link

SortedLists instead of hash tables causes higher order complexity super/subset lookups #14

Open GregoryMorse opened 4 years ago

GregoryMorse commented 4 years ago

The paper on this topic was not very good in its complexity analysis trying to claim a constant bound in practice when it should have listed specific worst-time and average-case complexities. It managed to do quite well getting a few citations for the novelty of an idea which had not been published before, but the mathematical analysis was poor.

The paper left sorting as not a necessary part of an implementation. But with sorting given an m-length set to find super/subsets, with c-size of a given nodes children in the trie, and d-number of nodes of given subtree of c: Worst case for supersets is O(m*c*d) as m-depth, c-comparisons, d-nodes searched where c and d are different at each iteration with linear search. With binary search it is O(m*d*log c). With hash tables it is O(m*d). For subsets it is O(m^2*c*d) as m-depth, mc comparisons, d-nodes searched. With binary search it is `O(cdm log m)and with hash tables it isO(md*min(m, c))whereO(min(m, c))is the complexity of set intersections for at least the Pythonset()` object. Binary search theoretically could be worse in average case depending on the data distribution but in general it will out-perform.

The average cases which are more important are more difficult to calculate due to having to reason about the distribution of the trie, and the distribution of the set, and being perfect about it is quite difficult as even metrics like total trie-depth, average child nodes per trie node, may not be right for representing a totally random distribution or something like a Gaussian distribution which is also a useful random average case.

I think switching to a Python set object, forgetting about SortedList, and simply taking the output of the iteration functions and putting it through sorted which is n log n complexity would yield significantly faster results. Adding in sorted order should be kept, as it will help to minimize the size of the trie in terms of memory.

GregoryMorse commented 4 years ago

After having thought about this for a while, the correct solution is to merely allow this as a __init__ option. Hash table/Dictionary (unordered_set in C) or sorted lists (set in C) would allow for different modes of operation which would be more or less complex depending on the data in the Set Trie and the subset/superset queries being invoked.