Closed oarcelus closed 1 month ago
Expansions can be check for equality the same way you do in Numpy:
(expansion1 == expansion2).all()
If you are comparing two polynomials in the expansion, it can be compared directly:
expansion[idx] == expansion[idy]
Hope this help.
Thanks it does help!! One more question, of course this will depend on the computing power, but I am seeing that generating expansion of more than 10k elements already takes a long time. Is this normal? Is there some limit to realistically generate this polynomials?
Not unusual at all.
The underlying datastructure is a plain numpy.ndarray
. There is no compression, so it takes poly.size * len(poly.coefficients)
where the former is the size of each coeffiecient, and the latter is the number of coefficients in the polynomials.
The limit of how big that can be varies from system to system, but on my arm64 system the numpy allocation limit is between 1e9 and 1e10, which is in the double digit number of gigabytes for floats.
Ok, so if the underlying datastructures are numpy.ndarrays
can I just exploit the built-in multithreading for them? For example if I am generating polynomials I am just launching my script like python myscript.py
, could I specify some OMP flags like OMP_NUMTHREADS=16 python myscript.py
and get some parallelization?
I think so, yes.
Thank you again! everything is clear
Sorry for reopening this issue but I have one related question about the performance of the method for generating the polynomial expansions. In a previous post I asked it is was possible generating a polynomial expansion directly from the results of the glexindex
function. The answer was yes.
alpha = cp.glexindex(
start=0,
stop=stop,
dimensions=dimension,
cross_truncation=trunc,
graded=True,
reverse=True,
).T
qs = cp.variable(len(config.distribution))
polyno = cp.prod(
[
cp.generate_expansion(config.order, config.distribution[idx], normed=True)[
alpha[idx]
](**{"q0": qs[idx]})
for idx in range(len(config.distribution))
],
axis=0
)
However I am wondering if my bottleneck is exactly here. Indeed the generation of the alphas takes a small fraction of the time while the majority of the time is spent in the second part of the operation. I think that the reasion is because somehow this function generates a non-truncated full-order polynomial first in generate_expansion
and then it masks the generated array with the corresponding alphas
. Is this however the most effective way? The generate_expansion
function itself allows for including cross truncation values. So I guess it internally handles the truncation in a more effective way?
Yes, I that sounds about right, but not by much. I am looking through the code, and the biggest difference is that I am creating all full expansions in one sweep, instead of iteratively like you. The code:
_, polynomials, _ = chaospy.stieltjes(numpy.max(order), dist)
polynomials = polynomials.reshape((len(dist), numpy.max(order) + 1))
order = numpy.array(order)
indices = numpoly.glexindex(
start=0,
stop=order + 1,
dimensions=len(dist),
graded=graded,
reverse=reverse,
cross_truncation=cross_truncation,
)
polynomials = numpoly.prod(
chaospy.polynomial(
[poly[idx] for poly, idx in zip(polynomials, indices.T)]
),
0,
)
Hi again, indeed I have been investigating the codebase a little bit better and I have detected the timing bottleneck in all this process. In particlar, the really timeconsuming part of the whole generate_expansion
function call is in the numpoly.prod
part. It takes almost the totallity of the computing time. Inside of this function, the particularly time consuming part is in the numpoly.multiply
function call, in this section:
seen = set()
for expon1, coeff1 in zip(x1.exponents, x1.coefficients):
for expon2, coeff2 in zip(x2.exponents, x2.coefficients):
key = (expon1 + expon2 + x1.KEY_OFFSET).ravel()
key = key.view(f"U{len(expon1)}").item()
if key in seen:
out_.values[key] += numpy.multiply(
coeff1, coeff2, where=where, **kwargs
)
else:
numpy.multiply(
coeff1, coeff2, out=out_.values[key], where=where, **kwargs
)
seen.add(key)
I have made my best effort to try to optimize this. I started with numba
but string manipulations for the structured array keys were poorly optimized. I have managed a very efficient implementation of the above loops using cython
(https://github.com/oarcelus/numpoly). And I tested it with this
import chaospy as cp
import time
distribution = cp.J(
cp.Uniform(1e-17, 1e-13),
cp.Uniform(1e-17, 1e-13),
cp.Uniform(1e-12, 1e-9),
cp.Uniform(1e-12, 1e-9),
cp.Uniform(0.001, 0.1),
cp.Uniform(0.001, 0.1),
cp.Uniform(0.1, 300.0),
cp.Uniform(0.1, 300.0),
cp.Uniform(5.0, 15.0),
cp.Uniform(5.0, 15.0),
cp.Uniform(1.0, 3.0),
cp.Uniform(1.0, 3.0),
cp.Uniform(1.0, 3.0),
)
start = time.time()
for _ in range(10):
polynomial1 = cp.generate_expansion(3, distribution, normed=True)
end = time.time()
print(end - start)
The overall solution times are similar, but if you look at the new implementation with cython
names = out_.values.dtype.names
values = numpy.asarray([out_.values[name] for name in names])
output = numpoly.cmultiply(
x1.exponents,
x2.exponents,
x1.coefficients,
x2.coefficients,
x1.KEY_OFFSET,
names,
values,
)
out_ = numpoly.polynomial(
numpy.array([tuple(value) for value in output.T], dtype=out_.values.dtype)
)
The multiplication process itself in cmultiply
is almost 100x faster than the former implementation. But now there is a significant overhead on the generation of the values
and names
and the new out_
. In the following, a representative case for the timings in one of the loops, the cython
implementation is also segmented into the names
and values
variable generation, the cmultiply
function call and the new out_
generation
PYTHON 1.0414330959320068
Names-Values 1.251389503479004
Loop 0.01744389533996582
Polynomial 0.23286747932434082
CYTHON 1.5018203258514404
See that the Loop itself is super fast. I belive this is promising, if there is a good way of implementing this it could unlock much greater capabilities for chaospy
. Because it would be possible to offset costly operations directly into C code. For now I tried to vary pass the out_.values
variable directly and change the structured array values inside cmultiply
. But then when recovering the output, I cannot do something like out_.values = output
because the setter
function is commented out in numpoly
. Do you see a way of trying to improve this? I belive there is potential.
Best regards and sorry for the long post
Some additional information on top of the above. The real bottleneck is in the structured numpy.ndarray
field lookups. Indeed structured arrays are not really efficient for repeated lookup purposes and are more prepared for mimicking C structs. For example, if I first cast the structured array to a dictionary and then run the loops I offset the timings such that the majority of the time is now on the generation of the new dictionary (that requires lookups) rather than the loops. Is there some particular reason for the ndpoly
object to be a structured array? If it is only about lookups of the string representations of the polynomial exponents it would be more efficient to store the object in another data structure such as dict. Maybe this would now impact memory in benefit of speed.
Cast Dict 1.1413066387176514
PYTHON 0.1093614101409912
Quick update. I have managed to optimize the generate_expansion
function by quite a lot.
For a joint distribution of 13 unknown variables I can generate a full 4th order polynomial in 1s vs 102s it took with the previous implementation of the numpoly.multiply
and numpoly.polynomial_from_attributes
functions.
I have checked that all generated polynomials between 1st and 4th order are the same in my implementation by running np.all(poly_previous == poly_new)
. I will try to test if the approach is robust for higher order polynomials, hyperbolic truncation, and custom alpha indices.
There are still some functions withing numpoly
that depend on key lookups of the poly.values
structured array. It may be wise to implement cython functions for those as well in order to make it blazingly fast Q(0.0Q)
I am sorry that I haven't checked in before now, I've been quite busy lately.
This is amazing! I have been aware that numpoly.prod
has been quite unoptimized, but diagnosing down to the structured arrays. That is really good to now.
Structured arrays is used because I want to keep compatability with the numpy dispatch protocol, and that was the easiest way to do that. Other then that, there are no particular reasons.
Do you mind making your optimizations into PRs in numpoly? I would love to have a look.
Hi again! No problem. I have been quite active these days because for my research I would really need to improve this so I have rushed these solutions. Here a more complete benchmark. Also, other features like hyperbolic truncation etc. work fine with the Cython implementation.
I have identified one more bottleneck in the '#main loop' of the numpy.call
procedure for polynomial evaluation. I think there is no issues with structured array key lookups here since the numpoly.outer
function still calls numpoly.multiply
which should be optimized now so I guess it is simply the two python nested loops that are inefficient when looping. But I have to look at this into more detail. I think these additional optimization could be super helpful aswell.
I sent a PR from my fork to the original repository. As I said in my earlier message I think it would be wise to implement a Cython solution to the rest of places along the codebase that rely on frequent key lookups of structured numpy arrays.
When generating polynomial bases through
generate_expansion
function I want to check if two polynomials are the same for a given set of conditions. For example when building polynomials for a given expansion order and hyperbolic truncations. Many times I get polynomials that are the same but I want to programatically check this.