jonathf / chaospy

Chaospy - Toolbox for performing uncertainty quantification.
https://chaospy.readthedocs.io/
MIT License
447 stars 87 forks source link

How to check if two polynomials are the same #433

Closed oarcelus closed 1 month ago

oarcelus commented 1 month ago

When generating polynomial bases through generate_expansion function I want to check if two polynomials are the same for a given set of conditions. For example when building polynomials for a given expansion order and hyperbolic truncations. Many times I get polynomials that are the same but I want to programatically check this.

jonathf commented 1 month ago

Expansions can be check for equality the same way you do in Numpy:

(expansion1 == expansion2).all()

If you are comparing two polynomials in the expansion, it can be compared directly:

expansion[idx] == expansion[idy]

Hope this help.

oarcelus commented 1 month ago

Thanks it does help!! One more question, of course this will depend on the computing power, but I am seeing that generating expansion of more than 10k elements already takes a long time. Is this normal? Is there some limit to realistically generate this polynomials?

jonathf commented 1 month ago

Not unusual at all.

The underlying datastructure is a plain numpy.ndarray. There is no compression, so it takes poly.size * len(poly.coefficients) where the former is the size of each coeffiecient, and the latter is the number of coefficients in the polynomials.

The limit of how big that can be varies from system to system, but on my arm64 system the numpy allocation limit is between 1e9 and 1e10, which is in the double digit number of gigabytes for floats.

oarcelus commented 1 month ago

Ok, so if the underlying datastructures are numpy.ndarrays can I just exploit the built-in multithreading for them? For example if I am generating polynomials I am just launching my script like python myscript.py, could I specify some OMP flags like OMP_NUMTHREADS=16 python myscript.py and get some parallelization?

jonathf commented 1 month ago

I think so, yes.

oarcelus commented 1 month ago

Thank you again! everything is clear

oarcelus commented 1 month ago

Sorry for reopening this issue but I have one related question about the performance of the method for generating the polynomial expansions. In a previous post I asked it is was possible generating a polynomial expansion directly from the results of the glexindex function. The answer was yes.

alpha = cp.glexindex(
    start=0,
    stop=stop,
    dimensions=dimension,
    cross_truncation=trunc,
    graded=True,
    reverse=True,
).T

qs = cp.variable(len(config.distribution))
polyno = cp.prod(
    [
        cp.generate_expansion(config.order, config.distribution[idx], normed=True)[
            alpha[idx]
        ](**{"q0": qs[idx]})
        for idx in range(len(config.distribution))
    ],
    axis=0
)

However I am wondering if my bottleneck is exactly here. Indeed the generation of the alphas takes a small fraction of the time while the majority of the time is spent in the second part of the operation. I think that the reasion is because somehow this function generates a non-truncated full-order polynomial first in generate_expansion and then it masks the generated array with the corresponding alphas. Is this however the most effective way? The generate_expansion function itself allows for including cross truncation values. So I guess it internally handles the truncation in a more effective way?

jonathf commented 1 month ago

Yes, I that sounds about right, but not by much. I am looking through the code, and the biggest difference is that I am creating all full expansions in one sweep, instead of iteratively like you. The code:

    _, polynomials, _ = chaospy.stieltjes(numpy.max(order), dist)
    polynomials = polynomials.reshape((len(dist), numpy.max(order) + 1))

    order = numpy.array(order)
    indices = numpoly.glexindex(
        start=0,
        stop=order + 1,
        dimensions=len(dist),
        graded=graded,
        reverse=reverse,
        cross_truncation=cross_truncation,
    )
    polynomials = numpoly.prod(
        chaospy.polynomial(
            [poly[idx] for poly, idx in zip(polynomials, indices.T)]
        ),
        0,
    )
oarcelus commented 1 month ago

Hi again, indeed I have been investigating the codebase a little bit better and I have detected the timing bottleneck in all this process. In particlar, the really timeconsuming part of the whole generate_expansion function call is in the numpoly.prod part. It takes almost the totallity of the computing time. Inside of this function, the particularly time consuming part is in the numpoly.multiply function call, in this section:

seen = set()
for expon1, coeff1 in zip(x1.exponents, x1.coefficients):
    for expon2, coeff2 in zip(x2.exponents, x2.coefficients):
        key = (expon1 + expon2 + x1.KEY_OFFSET).ravel()
        key = key.view(f"U{len(expon1)}").item()
        if key in seen:
            out_.values[key] += numpy.multiply(
                coeff1, coeff2, where=where, **kwargs
            )
        else:
            numpy.multiply(
                coeff1, coeff2, out=out_.values[key], where=where, **kwargs
            )
        seen.add(key)

I have made my best effort to try to optimize this. I started with numba but string manipulations for the structured array keys were poorly optimized. I have managed a very efficient implementation of the above loops using cython (https://github.com/oarcelus/numpoly). And I tested it with this

import chaospy as cp
import time

distribution = cp.J(
    cp.Uniform(1e-17, 1e-13),
    cp.Uniform(1e-17, 1e-13),
    cp.Uniform(1e-12, 1e-9),
    cp.Uniform(1e-12, 1e-9),
    cp.Uniform(0.001, 0.1),
    cp.Uniform(0.001, 0.1),
    cp.Uniform(0.1, 300.0),
    cp.Uniform(0.1, 300.0),
    cp.Uniform(5.0, 15.0),
    cp.Uniform(5.0, 15.0),
    cp.Uniform(1.0, 3.0),
    cp.Uniform(1.0, 3.0),
    cp.Uniform(1.0, 3.0),
)

start = time.time()
for _ in range(10):
    polynomial1 = cp.generate_expansion(3, distribution, normed=True)
end = time.time()

print(end - start)

The overall solution times are similar, but if you look at the new implementation with cython

names = out_.values.dtype.names
values = numpy.asarray([out_.values[name] for name in names])

output = numpoly.cmultiply(
    x1.exponents,
    x2.exponents,
    x1.coefficients,
    x2.coefficients,
    x1.KEY_OFFSET,
    names,
    values,
)

out_ = numpoly.polynomial(
    numpy.array([tuple(value) for value in output.T], dtype=out_.values.dtype)
)

The multiplication process itself in cmultiply is almost 100x faster than the former implementation. But now there is a significant overhead on the generation of the values and names and the new out_. In the following, a representative case for the timings in one of the loops, the cython implementation is also segmented into the names and values variable generation, the cmultiply function call and the new out_ generation

PYTHON 1.0414330959320068
Names-Values 1.251389503479004
Loop 0.01744389533996582
Polynomial 0.23286747932434082
CYTHON 1.5018203258514404

See that the Loop itself is super fast. I belive this is promising, if there is a good way of implementing this it could unlock much greater capabilities for chaospy. Because it would be possible to offset costly operations directly into C code. For now I tried to vary pass the out_.values variable directly and change the structured array values inside cmultiply. But then when recovering the output, I cannot do something like out_.values = output because the setter function is commented out in numpoly. Do you see a way of trying to improve this? I belive there is potential.

Best regards and sorry for the long post

oarcelus commented 1 month ago

Some additional information on top of the above. The real bottleneck is in the structured numpy.ndarray field lookups. Indeed structured arrays are not really efficient for repeated lookup purposes and are more prepared for mimicking C structs. For example, if I first cast the structured array to a dictionary and then run the loops I offset the timings such that the majority of the time is now on the generation of the new dictionary (that requires lookups) rather than the loops. Is there some particular reason for the ndpoly object to be a structured array? If it is only about lookups of the string representations of the polynomial exponents it would be more efficient to store the object in another data structure such as dict. Maybe this would now impact memory in benefit of speed.

Cast Dict 1.1413066387176514
PYTHON 0.1093614101409912
oarcelus commented 1 month ago

Quick update. I have managed to optimize the generate_expansion function by quite a lot.

For a joint distribution of 13 unknown variables I can generate a full 4th order polynomial in 1s vs 102s it took with the previous implementation of the numpoly.multiply and numpoly.polynomial_from_attributes functions.

I have checked that all generated polynomials between 1st and 4th order are the same in my implementation by running np.all(poly_previous == poly_new). I will try to test if the approach is robust for higher order polynomials, hyperbolic truncation, and custom alpha indices.

There are still some functions withing numpoly that depend on key lookups of the poly.values structured array. It may be wise to implement cython functions for those as well in order to make it blazingly fast Q(0.0Q)

jonathf commented 1 month ago

I am sorry that I haven't checked in before now, I've been quite busy lately.

This is amazing! I have been aware that numpoly.prod has been quite unoptimized, but diagnosing down to the structured arrays. That is really good to now.

Structured arrays is used because I want to keep compatability with the numpy dispatch protocol, and that was the easiest way to do that. Other then that, there are no particular reasons.

Do you mind making your optimizations into PRs in numpoly? I would love to have a look.

oarcelus commented 1 month ago

Hi again! No problem. I have been quite active these days because for my research I would really need to improve this so I have rushed these solutions. Here a more complete benchmark. Also, other features like hyperbolic truncation etc. work fine with the Cython implementation.

image

I have identified one more bottleneck in the '#main loop' of the numpy.call procedure for polynomial evaluation. I think there is no issues with structured array key lookups here since the numpoly.outer function still calls numpoly.multiply which should be optimized now so I guess it is simply the two python nested loops that are inefficient when looping. But I have to look at this into more detail. I think these additional optimization could be super helpful aswell.

I sent a PR from my fork to the original repository. As I said in my earlier message I think it would be wise to implement a Cython solution to the rest of places along the codebase that rely on frequent key lookups of structured numpy arrays.