Open oarcelus opened 5 months ago
In and of itself that sounds really slow. But may I ask, how big is the model, as in the shape of on sample evaluation? If it is big, that could make sense.
The polynomials shape with 6 parameters in the 4-th order are of shape (6, 210). For the PCE, I then sample 210 points from the joint PDF with the 'latin-hypercube' sampling method, and evaluate the model. The output of the model is a (210, 1000) shaped array. Fitting a regression with this size takes pretty long.
I was able to reduce times by being more smart about the model output shape. Some of my simulations can have less than 1000 output points and I can save time like this. But some others don't because of the output features.
So in this process I came up with more questions:
1) Is it enough to sample 210 points? So as far as I understan the number of polynomials is P = (N + p)!/N!p! where N is the number of unknown parameters and p is the number of coefficientes in the expansion, and P is the order of the polynomials. Then it is enough to evaluate the model 'p' times in order to get a surrogate model. Is this the case or should I increase the number of evaluations?
2) I tried the Pseudo-Spectral Projection method, but the amount of evaluations was pretty high in the quadrature, even after using the Smolyak's grid (number of samples went from 10k to 1k) but still more expensive to evaluate. However, I would guess that even if more samples are needed the fitting process is faster, because the fourier coefficients are estimated by integrating. Do you think that this is the case?
3) From the 'Advanced regression method' section I have found some related issues thread talking about the slowness of some regression methods (https://github.com/scikit-learn/scikit-learn/issues/13923). They were able to reduce regression timings for example by centering the data, which allowed them to avoid fitting intercepts aswell, and making hard copies of the data. I see that chaospy
uses the numpy.linalg.lstsq
function, which by default will fit intercepts. Centering the data is case specific, but if I were to do it is there an easy way of post-processing the resulting surrogate model to accept the 'uncentered' parameters again?
Sorry for the long post, and thanks for answering back!
210 000 should not be an issue performence whice. Using default method of lstsq should do that fast. If it is slow, that sounds like an issue with numpy. Just a sanity check: what is your output's .dtype
? If it is object
the you need to recast it to something numerical first.
retall=True
to cp.fit_regression
to get the fourier coefficients. If you use a polynomial expansion that starts with a constant, then the first fourier coefficient represents the intercept parameter. You can manipulate it before constructing a new approximation with cp.sum(expansion * coefficients.T, axis=-1)
. You might need to reshape the result afterwards.So the signature this is the way I am calling fit regression.
cp.fit_regression(polyno, samples, evals)
polyno.dtype -> float64
samples.dtype -> float64
evals
type is list[np.ndarray]
. Then the its contents, for examples evals[0].dtype -> float64
distribution = cp.J( cp.Uniform(1e-18, 1e-14), cp.Uniform(1e-18, 1e-14), cp.Uniform(1e-13, 1e-9), cp.Uniform(1e-13, 1e-9), cp.Uniform(1.0, 10.0), cp.Uniform(0.1, 4.0))
And the sampling is 'gaussian' so I see that there is something wrong already. The problem I see with this method is that my model is prone to crash for certain parameter combinations. So that I cannot just 'eliminate' them because the sampling is deterministic and it would break the sampling serie. Gaussian sampling, don't refer to Gaussian distribution, but optimal Gaussian quadrature. If you want to use psuedo-spectral approach, use Clenshaw-curtis for your samples. Fast, simple, stable, well tailored for uniform distributions, and most of all: nested nearly perfectly with sparse grid.
Your distributions seem to be a bit ill-posed. as 1e-18, 1e-14
is quite far from 1, 10
. I would recommend you switch to one of the Generalized PCE methods and you use cp.Uniform(-1, 1)
for all your base distributions.
With crashes, I think you should use regression. You will have a better chance of getting good results in case you need to remove some samples.
I am trying to build a surrogate model using PCE for 6 unknown variables and 4-th order polynomials.
I build my samples using the latin-hypercube sampling and evaluate all the samples with my models. I have succesfully parallelize the evaluation procedure and I am able to run 200-ish evaluations (6 variablas and 4-th order polynomials) in few minutes. However the last step of the procedure, i.e., creating a surrogate model and obtaining the sobol indices takes a long time, about the same time it takes to run the evaluations.
I would like to know if this is normal. Maybe I am messing up the multiprocessing and there are some problems with that.