implement van Hoeven's algorithm for relaxed multiplication of power series

mantepse commented 2 years ago

This is an experimental implementation of the algorithm presented in section 4.2. of van der Hoeven's "Relax but don't be too lazy".

CC: @tscrim @fchapoton

Component: combinatorics

Keywords: LazyPowerSeries

Author: Martin Rubey

Branch/Commit: u/mantepse/implement_van_hoeven_s_algorithm_for_relaxed_multiplication_of_power_series @ 1de1e6d

Issue created by migration from https://trac.sagemath.org/ticket/34616

mantepse commented 2 years ago

Branch: u/mantepse/implement_van_hoeven_s_algorithm_for_relaxed_multiplication_of_power_series

mantepse commented 2 years ago

comment:2

Unfortunately, the code does not work yet, because I have some trouble with some details in the article.

Last 10 new commits:

`52ac7af`	`pyflakes observations`
`06ec3d0`	`Merge branch 'develop' of trac.sagemath.org:sage into t/32367/replace_lazy_power_series-32367`
`4172688`	`Merge branch 'u/mantepse/implement_arithmetic_product_of_lazy_symmetric_functions' of trac.sagemath.org:sage into t/32367/replace_lazy_power_series-32367`
`41ca99b`	`Merge branch 'u/mantepse/replace_lazy_power_series-32367' of trac.sagemath.org:sage into t/34470/categories_lazy_series-34470`
`fb3a5cd`	`Merge branch 'u/mantepse/categories_lazy_series-34470' of trac.sagemath.org:sage into t/34552/lazy_series_test_suite-34552`
`855d2bf`	`remove unused variable`
`75c275c`	`add documentation and doctests for _approximate_order`
`5393242`	`fixes for pycodestyle and pyflakes`
`081d0e5`	`Merge branch 'u/mantepse/lazy_series_test_suite-34552' of trac.sagemath.org:sage into t/34616/implement_van_hoeven_s_algorithm_for_relaxed_multiplication_of_power_series`
`87d9db1`	`non-working first attempt`

mantepse commented 2 years ago

Author: Martin Rubey

mantepse commented 2 years ago

Description changed:

--- 
+++ 
@@ -1 +1 @@
-
+This is an experimental implementation of the algorithm presented in section 4.2. of van der Hoeven's "Relax but don't be too lazy".

mantepse commented 2 years ago

Changed keywords from none to LazyPowerSeries

mantepse commented 2 years ago

Commit: 87d9db1

mantepse commented 2 years ago

comment:3

I think the mistake is that the caching is not implemented correctly. (References below are to van der Hoeven's paper)

Sec. 2.2., pg. 484 says that Series_Rep has an attribute n, which says up to which degree the values in the cache are already correct. However, The order of φ is allowed to exceed n in order to anticipate future computations. Moreover, the coefficients φ_0, . . . , φ_{k−1} must be computed before computing φ_k.

In Sec. 4.2.1., pg. 502, the definition of of DAC_Rep does not specify n initially, so it is probably meant to be 0. However, in Sec. 4.2.3. the definition of DAC_Rep does specify n := N/2, which the current branch does not take into account.

The implementation of van der Hoeven's algorithm should really use the dense version of streams. It probably makes sense to adapt the framework slightly.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

1de1e6d ugly, but working

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 2 years ago

Changed commit from 87d9db1 to 1de1e6d

mantepse commented 2 years ago

comment:5

Although the implementation is quite ugly in some details, I think we can learn enough for some decisions.

the code (without any optimizations) begins to be faster for integer multiplication, if we want to compute all the first 270 or so coefficients of the product:

sage: from sage.data_structures.stream import (Stream_cauchy_mul, Stream_cauchy_mul_fast, Stream_function)
sage: f = Stream_function(lambda n: n, True, 0)
sage: g = Stream_function(lambda n: 1, True, 0)

sage: %time h1 = Stream_cauchy_mul_fast(f, g, threshold=2^5); l1 = [h1[i] for i in range(270)]
CPU times: user 40.7 ms, sys: 0 ns, total: 40.7 ms
Wall time: 40.7 ms
sage: %time h2 = Stream_cauchy_mul(f, g); l2 = [h2[i] for i in range(270)]
CPU times: user 43.1 ms, sys: 0 ns, total: 43.1 ms
Wall time: 43.1 ms

sage: %time h1 = Stream_cauchy_mul_fast(f, g, threshold=2^5); l1 = [h1[i] for i in range(2000)]
CPU times: user 712 ms, sys: 0 ns, total: 712 ms
Wall time: 712 ms
sage: %time h2 = Stream_cauchy_mul(f, g); l2 = [h2[i] for i in range(2000)]
CPU times: user 1.85 s, sys: 0 ns, total: 1.85 s
Wall time: 1.85 s

if we are only interested in a single coefficient of the product, I don't think that we can come even close to the naive algorithm, and here the difference is, unfortunately, huge:

sage: %time h1 = Stream_cauchy_mul_fast(f, g, threshold=2^5); l1 = h1[1000]
CPU times: user 233 ms, sys: 0 ns, total: 233 ms
Wall time: 233 ms
sage: %time h2 = Stream_cauchy_mul(f, g); l2 = h2[1000]
CPU times: user 2.43 ms, sys: 0 ns, total: 2.43 ms
Wall time: 2.44 ms

I am guessing that the most interesting application would be in the computation of the composition and the plethysm of power series. There, we always compute contiguous segments of coefficients of products. For example:

sage: from sage.data_structures.stream import Stream_function, Stream_cauchy_compose
sage: f = Stream_function(lambda n: n, True, 1)
sage: g = Stream_function(lambda n: n^2, True, 1)
sage: h = Stream_cauchy_compose(f, g)
sage: h[20]
289074264180
sage: [(len(x._cache), min(x._cache), max(x._cache)) for x in h._pos_powers[1:]]
[(20, 1, 20),
 (19, 2, 20),
 (18, 3, 20),
 (17, 4, 20),
 (16, 5, 20),
 (15, 6, 20),
 (14, 7, 20),
 (13, 8, 20),
 (12, 9, 20),
 (11, 10, 20),
 (10, 11, 20),
 (9, 12, 20),
 (8, 13, 20),
 (7, 14, 20),
 (6, 15, 20),
 (5, 16, 20),
 (4, 17, 20),
 (3, 18, 20),
 (2, 19, 20),
 (1, 20, 20)]

For the dense setting, I think that it is clear that we should be using van der Hoeven's algorithm. Of course, some polishing is needed. For example, the current implementation duplicates the caching mechanism:

sage: from sage.data_structures.stream import (Stream_cauchy_mul, Stream_cauchy_mul_fast, Stream_function)
sage: f = Stream_function(lambda n: n, True, 0)
sage: g = Stream_function(lambda n: 1, True, 0)
sage: h1 = Stream_cauchy_mul_fast(f, g, threshold=2^5)
sage: l1 = [h1[i] for i in range(10)]
sage: len(h1._h._phi)
64
sage: h1._h._phi[:15]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 0, 0, 0, 0, 0]
sage: h1._cache
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45]

mantepse commented 2 years ago

comment:6

Somewhat orthogonal to the ticket: I think the decision whether a Stream_XXX class uses a dense or a sparse cache should not depend on the input streams.

For example, for Stream_add it is actually irrelevant whether the input streams left and right are dense or sparse.

For Stream_zero and Stream_exact, a dense cache never makes sense.

mantepse commented 2 years ago

comment:7

See #34636

tscrim commented 2 years ago

comment:8

Thank you for doing this.

+1 on using van der Hoeven's algorithm for the dense. We might want to add some more documentation to this to explain when one version might be preferred to the other.

mantepse commented 2 years ago

comment:9

Thank you for looking at it :-) - and more generally, all the reviews!

The thing I am somewhat stuck with here is the class hierarchy. Consider:

class Stream_cauchy_mul_DAC():
    def __init__(self, left, right, phi, N, threshold):
        self._left = [left[k] for k in range(N)]
        self._right = [right[k] for k in range(N)]
        if phi is None:
            self._phi = [0]*(2*N)
            self._lo = None
            self._n = ZZ.zero()
        else:
            # TODO: the first N/2 entries of self._phi are already
            # computed, the computation of the next N/2 is initiated
            # by the next line.  Could / should this be done lazily?
            self._phi = [phi[k] for k in range(N)] + [0]*N
            self._lo = phi
            self._n = ZZ(N / 2)

Currently, this class does not inherit from anything. However, it shares functionality with Stream_inexact by providing the cache. Moreover, self._phi[:n] is actually the same as the cache, except that we produce (recursively) many copies of this, via get_coefficient below.

class Stream_cauchy_mul_fast(Stream_binary):
    def __init__(self, left, right, threshold=2 ** 1):
        super().__init__(left, right, False)
        self._threshold = threshold
        self._h = Stream_cauchy_mul_DAC(left, right, None,
                                        self._threshold, self._threshold)

    def get_coefficient(self, n):
        if n >= self._threshold and is_power_of_two(n):
            self._h = Stream_cauchy_mul_DAC(self._left, self._right, self._h,
                                            2*n, self._threshold)
        return self._h[n]

In summary: to my eyes, this is a complete mess. I think it would be more beautiful if the caching mechanism provided by Stream_inexact would be reused, possibly slightly generalized.

I do not understand the algorithm well enough to see how much of self._phi could actually be shared. Ideally, I'd like a single cache (provided by Stream_inexact or Stream_cauchy_mul) which is manipulated by Stream_cauchy_mul_DAC.

tscrim commented 2 years ago

comment:10

Is there a reason why these need to be two separate classes? It seems like you really just want to have one class that inherits from Stream_binary. I don't quite understand the obstruction to this from a quick look.

mantepse commented 2 years ago

comment:11

Stream_cauchy_mul_fast._h has it's own state (i.e., _phi). In fact, this is precisely the question: can the_phi attributes of the various Stream_cauchy_mul_DAC instances share memory?

(I should have documented: DAC is for divide and conquer)

tscrim commented 2 years ago

comment:12

It is certainly possible by passing the same list around and mutating it. However, each instance would just have to know which block is its responsibility, which creates a bit more complicated code structure and is likely harder to debug. How often are lists needed to be (re)constructed or are short-lived transient objects?

sagemath / sage

implement van Hoeven's algorithm for relaxed multiplication of power series #34616