Questions about `fit_regression` function

oarcelus commented 5 months ago

I am trying to build a surrogate model using PCE for 6 unknown variables and 4-th order polynomials.

I build my samples using the latin-hypercube sampling and evaluate all the samples with my models. I have succesfully parallelize the evaluation procedure and I am able to run 200-ish evaluations (6 variablas and 4-th order polynomials) in few minutes. However the last step of the procedure, i.e., creating a surrogate model and obtaining the sobol indices takes a long time, about the same time it takes to run the evaluations.

I would like to know if this is normal. Maybe I am messing up the multiprocessing and there are some problems with that.

jonathf commented 5 months ago

In and of itself that sounds really slow. But may I ask, how big is the model, as in the shape of on sample evaluation? If it is big, that could make sense.

oarcelus commented 4 months ago

The polynomials shape with 6 parameters in the 4-th order are of shape (6, 210). For the PCE, I then sample 210 points from the joint PDF with the 'latin-hypercube' sampling method, and evaluate the model. The output of the model is a (210, 1000) shaped array. Fitting a regression with this size takes pretty long.

I was able to reduce times by being more smart about the model output shape. Some of my simulations can have less than 1000 output points and I can save time like this. But some others don't because of the output features.

So in this process I came up with more questions: 1) Is it enough to sample 210 points? So as far as I understan the number of polynomials is P = (N + p)!/N!p! where N is the number of unknown parameters and p is the number of coefficientes in the expansion, and P is the order of the polynomials. Then it is enough to evaluate the model 'p' times in order to get a surrogate model. Is this the case or should I increase the number of evaluations? 2) I tried the Pseudo-Spectral Projection method, but the amount of evaluations was pretty high in the quadrature, even after using the Smolyak's grid (number of samples went from 10k to 1k) but still more expensive to evaluate. However, I would guess that even if more samples are needed the fitting process is faster, because the fourier coefficients are estimated by integrating. Do you think that this is the case? 3) From the 'Advanced regression method' section I have found some related issues thread talking about the slowness of some regression methods (https://github.com/scikit-learn/scikit-learn/issues/13923). They were able to reduce regression timings for example by centering the data, which allowed them to avoid fitting intercepts aswell, and making hard copies of the data. I see that chaospy uses the numpy.linalg.lstsq function, which by default will fit intercepts. Centering the data is case specific, but if I were to do it is there an easy way of post-processing the resulting surrogate model to accept the 'uncentered' parameters again?

Sorry for the long post, and thanks for answering back!

jonathf commented 4 months ago

210 000 should not be an issue performence whice. Using default method of lstsq should do that fast. If it is slow, that sounds like an issue with numpy. Just a sanity check: what is your output's .dtype? If it is object the you need to recast it to something numerical first.

How many samples you need is outside the scope of what I can answer. If the model is sufficiently smooth, then yes. A good way to test is to see if lower order approximation give similar results. If it does, there is more reason to trust it.
What sampling scheme are use using. To reap the benefits of sparse grid you need to use something like clenshaw-curtis samples or similar. What probability distribution are you inputs?
You could try that, but I'm not sure how much it would help as lstsq don't have the speed issue. Here is the reciepy: Pass retall=True to cp.fit_regression to get the fourier coefficients. If you use a polynomial expansion that starts with a constant, then the first fourier coefficient represents the intercept parameter. You can manipulate it before constructing a new approximation with cp.sum(expansion * coefficients.T, axis=-1). You might need to reshape the result afterwards.

oarcelus commented 4 months ago

So the signature this is the way I am calling fit regression.

cp.fit_regression(polyno, samples, evals)

polyno.dtype -> float64 samples.dtype -> float64

evals type is list[np.ndarray]. Then the its contents, for examples evals[0].dtype -> float64

Clear, I will try to increase the sampling to see if results improve. And check if reducing the order of the polynomials maintain the results
I use the following probability distribution. distribution = cp.J( cp.Uniform(1e-18, 1e-14), cp.Uniform(1e-18, 1e-14), cp.Uniform(1e-13, 1e-9), cp.Uniform(1e-13, 1e-9), cp.Uniform(1.0, 10.0), cp.Uniform(0.1, 4.0)) And the sampling is 'gaussian' so I see that there is something wrong already. The problem I see with this method is that my model is prone to crash for certain parameter combinations. So that I cannot just 'eliminate' them because the sampling is deterministic and it would break the sampling serie.

jonathf commented 4 months ago

Gaussian sampling, don't refer to Gaussian distribution, but optimal Gaussian quadrature. If you want to use psuedo-spectral approach, use Clenshaw-curtis for your samples. Fast, simple, stable, well tailored for uniform distributions, and most of all: nested nearly perfectly with sparse grid.

Your distributions seem to be a bit ill-posed. as 1e-18, 1e-14 is quite far from 1, 10. I would recommend you switch to one of the Generalized PCE methods and you use cp.Uniform(-1, 1) for all your base distributions.

With crashes, I think you should use regression. You will have a better chance of getting good results in case you need to remove some samples.

jonathf / chaospy

Questions about `fit_regression` function #422