takemaru / graphillion

Fast, lightweight graphset operation library
Other
466 stars 40 forks source link

Store Path Counts in Graphillion #40

Closed seanlaw closed 5 years ago

seanlaw commented 5 years ago

Let's say that I have 8 paths/sequences where each number represents a node:

x = [[1, 2, 3, 4, 5], 
     [8, 9, 10], 
     [4, 6, 8, 10], 
     [1, 2, 3, 4, 5], 
     [8, 9, 10], 
     [11, 13, 14, 16],
     [1, 2, 3],
     [9, 10]
    ]

Is there a way to push these into a graph using Graphillion and then be able to keep track of the fact that paths [1, 2, 3, 4, 5] and [8, 9, 10] each have a count of 2 (i.e., these two paths are observed twice) while the other 6 paths have a count of 1?

In my case, I don't know ahead of time that, say, [1, 2, 3, 4, 5] is counted twice since the data is large so I would need a way to update the count as new data is being read. Is this possible with Graphillion?

takemaru commented 5 years ago

Let me know your conditions in detail.

seanlaw commented 5 years ago

Do you have too many paths to be stored as dict? (i.e. billions of paths)

Yes, the paths are in the millions to likely billions and I want to reduce the memory consumption using ZDD

Approximated counts are acceptable? Or exact counts are required for all paths?

Approximated counts are acceptable if there is a guarantee in terms of error bounds. I would like to know like to know relative percentages or be able to ask "which paths make up the top 10-20% of paths". Of course, exact counts is preferred if possible.

Is the base/universe graph is fixed or dynamic?

Universe is fixed. At least, I can determine the universe of nodes beforehand.

Before I forget to ask, are negative integers allowed as node numbers? That is, can I use [-1, 3, 4, -8] as a sequence input? I don't think I've ever gotten strings to work as node names either (i.e., ['a', 'b', 1, 'c']

takemaru commented 5 years ago

I think that counting sketch is an appropriate technique for your requirements. Counting sketch is a data structure used to approximately maintain many counters in small memory. This technique has been developed in computer networking community for flow monitoring. There are some open-sourced implementation, as follows:

A related research paper is found at https://dlnext.acm.org/doi/abs/10.1145/3341302.3342076 .

takemaru commented 5 years ago

Negative numbers can be used as node names in Graphillion. Actually, any hashable objects can be used.

seanlaw commented 5 years ago

I don't really understand the counting sketch work but it sounds like you would not recommend Graphillion for this?

takemaru commented 5 years ago

No. ZDD, the internal structure of Graphillion, represents a set of sets. From the mathematical definition, elements in a set cannot be duplicated, i.e., every element accounts for one or zero in a set. This implies that ZDD cannot be used to count elements, though ZDD is good at computing the size of a set (the number of elements in a set).

In other words, if you want to count the number of subsets (or that of subgraphs) satisfying given constraints, ZDD is a good tool. However, if you want to associate a number with each element (with each path), ZDD doesn't work.

takemaru commented 5 years ago

There is a solution to utilize ZDD as counters, that is, ZDD vector or ZDD base (an extension of ZDD, which is not implemented in Graphillion). However, counting sketch is more efficient if approximation is allowed.

seanlaw commented 5 years ago

That makes sense. Thank you!