Improve lookahead in IntegerListsLex

nthiery commented 9 years ago

IntegerListsLex uses a depth first search in a prefix tree, with lookahead to prune branches. The purpose of this ticket is to improve the lookahead algorithm in two directions:

Detect more (all?) branches that can be cut.
Better time complexity; this probably will involve identifying important invariants and caching them.

Example:: even just for n=1000 this is taking a long time:

    sage: n = 100
    sage: IntegerListsLex(binomial(n,2)-1, length=n, min_slope=1).list()

The following also take a long time, which should be optimized:

    sage: IntegerListsLex(10^100, max_length=1).list()
    sage: IntegerListsLex(499499, length=1000, min_slope=1).list()

One could hope that one could reach an average complexity roughly linear in O(log(max_n), max_length) per list generated.

Concrete steps:

If we want the floor function to satisfy the max_slope constraint, we need to do "reverse smoothing", and the floor function then depends on the interval [0,...,k) over which we want to consider it. The same holds for the ceiling function if we want it to satisfy the min_slope constraint. Let thus f_k and c_k be respectively the floor and ceiling functions for the interval [0,...k)

Let's ignore the constraints on the length and on n for now. Then, f_k and c_k are valid lists. Prove the following remark:

For any n between sum(f_k) and sum(c_k) there exists a valid list of length k and sum n.

Putting back the length and sum constraints, there exists a valid list if and only if the interval I_k = [sum(f_k), sum(c_k)] intersects the interval [min_n, max_n] for some k between min_length and max_length.
Extend Envelope with a method .sum(k) which computes sum(f_k).

It should have appropriate algorithmic and caching so that the overall time complexity of computing sum(f_k) for all k<=l is linear in l.

Trick: store, for each j, the smallest i<=j such that between i and j the slope is exactly max_slope. The point is that computing the sum of the parts of f_k between i and j is constant time. Use that incrementally to derive quickly sum(f_k) from sum(f_j) for some j<=k.

Suggestion for a faster development cycle: experiment with this by extracting the current code for Envelope in a separate file.
Use the above characterization and the .sum(k) methods to implement a method IntegerListsLex.is_empty which terminates with guaranteed result on non degenerate cases. This is essentially what the current lookahead method does, but starting from 0 and with guaranteed results thanks to the reverse smoothing.

Determine precisely the degenerate cases where is_empty may take an arbitrarily long time (something like: the ceiling can be 0 for an arbitrary long time, without a known limit=0 and limit_start).

Decide on appropriate metrics to bound tightly the complexity in the other cases (something around max_length, max_n, limit_start).

As a side product, we probably want to store for later usage the information of which n admit a valid list, as a collection of non overlapping intervals. Then, deciding whether there exists a valid list of sum n for a given n can be achieved efficiently by dichotomy.
Assuming that l is a prefix of a valid list, find out how to determine for which i l+[i] is still a prefix of a valid list.

Potential building block: the choice of a prefix imposes more constraints on the floor and ceiling later on (our former adapt method), and we want to keep track of that. Given a prefix l, write f_k(l) be the list obtained by extending l to a list of length k using the floor f_k, with appropriate forward smoothing. Same for c_k(l). Write I_k(l) = [sum(f_k(l)), sum(c_k(l))].

Now, l is a prefix of a valid list if one of the I_k(l) intersect [min_sum, max_sum]. So, while increasing the length of l we want to do something like incrementally adapting the sequence of intervals I_k(l) to always make sure we stay within one of them.
Use this to reimplement lookahead(self) implementing the above, with a guaranteed result and good complexity. And if at all possible a constant time complexity when the constraints are simple (e.g. floor=0, ...).

This method should support _current_list=[] as well.
Update next accordingly (the logic should be slightly simpler thanks to the handling of the empty prefix).
If not implemented elsewhere: let the constructor accept a prefix as input, to start the iteration from there. Derive a next method as a side effect.
Remove integer_list_old at the final stage.

Depends on #18109

CC: @bgillesp

Component: combinatorics

Issue created by migration from https://trac.sagemath.org/ticket/18055

anneschilling commented 9 years ago

Description changed:

--- 
+++ 
@@ -10,6 +10,11 @@
     sage: n = 100
     sage: IntegerListsLex(binomial(n,2)-1, length=n, min_slope=1).list()

+The following also takes a long time, which should be optimized: + +```

sage: IntegerListsLex(10^100, max_length=1).list() +```

One could hope that, in non degenerate cases, one could reach an average complexity roughly linear in O(log(max_n), max_length) per

anneschilling commented 9 years ago

Description changed:

--- 
+++ 
@@ -10,10 +10,11 @@
     sage: n = 100
     sage: IntegerListsLex(binomial(n,2)-1, length=n, min_slope=1).list()

-The following also takes a long time, which should be optimized: +The following also take a long time, which should be optimized:

     sage: IntegerListsLex(10^100, max_length=1).list()
+    sage: IntegerListsLex(499499, length=1000, min_slope=1).list()

One could hope that, in non degenerate cases, one could reach an

nthiery commented 9 years ago

Description changed:

--- 
+++ 
@@ -21,3 +21,43 @@
 average complexity roughly linear in `O(log(max_n), max_length)` per
 list generated.

+Concrete steps:
+
+- Extend `Envelope` with a function:
+
+  ```
+      def area(k):
+          """Return the area under the envelope between `0` and `k`"""
+  ```
+
+  This functions should handle the "reverse smoothing" from position
+  `k` (so that the envelope satisfies both `min_slope` and `max_slope`
+  conditions between `1` and `k`), and have appropriate algorithmic
+  and caching so that the overall complexity of computing all
+  `area(k)` for `k<=l` is linear in `l`.
+
+- Replace:
+
+  ```
+      def _possible_m(self, m, ...):
+  ```
+  by:
+
+  ```
+      def lookahead(self):
+  ```
+
+  which shall detect whether `_current_list` has a chance to be a
+  prefix of some valid list. Using the above `area` method of the
+  envelopes, it should be possible to detect most dead branches.
+
+  This method should support `_current_list=[]` as well.
+
+  This method should store as much information as relevant to avoid
+  later recomputation for lower nodes in the tree. A minima, this
+  could possibly return/store a range for the next value, removing the
+  need for a separate `_m_interval` method.
+
+- Update `next` accordingly (the logic should be slightly simpler
+  thanks to the handling of the empty prefix).
+

nthiery commented 9 years ago

Description changed:

--- 
+++ 
@@ -50,6 +50,8 @@
   which shall detect whether `_current_list` has a chance to be a
   prefix of some valid list. Using the above `area` method of the
   envelopes, it should be possible to detect most dead branches.
+  Using dichotomy on the area / parts, it should be possible to
+  achieve good complexity in most cases.

   This method should support `_current_list=[]` as well.

jdemeyer commented 9 years ago

Description changed:

--- 
+++ 
@@ -17,7 +17,7 @@
     sage: IntegerListsLex(499499, length=1000, min_slope=1).list()

-One could hope that, in non degenerate cases, one could reach an +One could hope that one could reach an average complexity roughly linear in O(log(max_n), max_length) per list generated.

jdemeyer commented 9 years ago

comment:5

Can we please stop talking about "degenerate cases" without specifying what that means?

nthiery commented 9 years ago

comment:6

Replying to @jdemeyer:

Can we please stop talking about "degenerate cases" without specifying what that means?

This ticket involves some research, which includes in particular defining precisely what "degenerate" is. Until this research is done, we can't do anything but remain vague.

nthiery commented 9 years ago

Description changed:

--- 
+++ 
@@ -63,3 +63,4 @@
 - Update `next` accordingly (the logic should be slightly simpler
   thanks to the handling of the empty prefix).

+- Remove `integer_list_old` at the final stage.

jdemeyer commented 9 years ago

Dependencies: #18181

jdemeyer commented 9 years ago

Changed dependencies from #18181 to #18109

nthiery commented 9 years ago

Description changed:

--- 
+++ 
@@ -1,7 +1,7 @@
 `IntegerListsLex` uses a depth first search in a prefix tree, with lookahead to prune branches.
 The purpose of this ticket is to improve the lookahead algorithm in two directions:

-- Detect more branches that can be cut.
+- Detect more (all?) branches that can be cut.
 - Better time complexity; this probably will involve identifying important invariants and caching them.

 Example:: even just for n=1000 this is taking a long time:
@@ -17,50 +17,93 @@
     sage: IntegerListsLex(499499, length=1000, min_slope=1).list()

-One could hope that one could reach an -average complexity roughly linear in O(log(max_n), max_length) per -list generated. +One could hope that one could reach an average complexity roughly +linear in O(log(max_n), max_length) per list generated.

Concrete steps:

-- Extend Envelope with a function: +- If we want the floor function to satisfy the max_slope constraint,

we need to do "reverse smoothing", and the floor function then
depends on the interval [0,...,k) over which we want to consider
it. The same holds for the ceiling function if we want it to satisfy
the min_slope constraint. Let thus f_k and c_k be respectively
the floor and ceiling functions for the interval [0,...k)
def area(k):
"""Return the area under the envelope between 0 and k"""
Let's ignore the constraints on the length and on n for now.
Then, f_k and c_k are valid lists. Prove the following remark:
This functions should handle the "reverse smoothing" from position
k (so that the envelope satisfies both min_slope and max_slope
conditions between 1 and k), and have appropriate algorithmic
and caching so that the overall complexity of computing all
area(k) for k<=l is linear in l.
For any n between sum(f_k) and sum(c_k) there exists a valid
list of length k and sum n.

-- Replace:

Putting back the length and sum constraints, there exists a valid
list if and only if the interval I_k = [sum(f_k), sum(c_k)]
intersects the interval [min_n, max_n] for some k between
min_length and max_length.
def _possible_m(self, m, ...):
by: +- Extend Envelope with a method .sum(k) which computes sum(f_k).
def lookahead(self):
It should have appropriate algorithmic and caching so that the
overall time complexity of computing sum(f_k) for all k<=l is
linear in l.
which shall detect whether _current_list has a chance to be a
prefix of some valid list. Using the above area method of the
envelopes, it should be possible to detect most dead branches.
Using dichotomy on the area / parts, it should be possible to
achieve good complexity in most cases.
Trick: store, for each j, the smallest i<=j such that between
i and j the slope is exactly max_slope. The point is that
computing the sum of the parts of f_k between i and j is
constant time. Use that incrementally to derive quickly sum(f_k)
from sum(f_j) for some j<=k.
Suggestion for a faster development cycle: experiment with this by
extracting the current code for Envelope in a separate file.
+- Use the above characterization and the .sum(k) methods to
implement a method IntegerListsLex.is_empty which terminates with
guaranteed result on non degenerate cases. This is essentially what
the current lookahead method does, but starting from 0 and with
guaranteed results thanks to the reverse smoothing.
Determine precisely the degenerate cases where is_empty may take
an arbitrarily long time (something like: the ceiling can be 0 for
an arbitrary long time, without a known limit=0 and limit_start).
Decide on appropriate metrics to bound tightly the complexity in the
other cases (something around max_length, max_n, limit_start).
As a side product, we probably want to store for later usage the
information of which n admit a valid list, as a collection of non
overlapping intervals. Then, deciding whether there exists a valid
list of sum n for a given n can be achieved efficiently by
dichotomy.
+- Assuming that l is a prefix of a valid list, find out how to
determine for which i l+[i] is still a prefix of a valid list.
Potential building block: the choice of a prefix imposes more
constraints on the floor and ceiling later on (our former adapt
method), and we want to keep track of that. Given a prefix l,
write f_k(l) be the list obtained by extending l to a list of
length k using the floor f_k, with appropriate forward
smoothing. Same for c_k(l). Write `I_k(l) = [sum(f_k(l)),
sum(c_k(l))]`.
Now, l is a prefix of a valid list if one of the I_k(l)
intersect [min_sum, max_sum]. So, while increasing the length of
l we want to do something like incrementally adapting the sequence
of intervals I_k(l) to always make sure we stay within one of
them.
+- Use this to reimplement lookahead(self) implementing the above,
with a guaranteed result and good complexity. And if at all possible
a constant time complexity when the constraints are simple
(e.g. floor=0, ...).

This method should support _current_list=[] as well.
This method should store as much information as relevant to avoid
later recomputation for lower nodes in the tree. A minima, this
could possibly return/store a range for the next value, removing the
need for a separate _m_interval method.
- Update next accordingly (the logic should be slightly simpler thanks to the handling of the empty prefix).

+- If not implemented elsewhere: let the constructor accept a prefix as

input, to start the iteration from there. Derive a next method as
a side effect.
- Remove integer_list_old at the final stage.

sagemath / sage

Improve lookahead in IntegerListsLex #18055