Gaussian process interpolation

rmjarvis commented 8 years ago

Add an interpolation scheme that implements some kind of Gaussian process interpolation.

If we can successfully model the optical portion of the PSF separately, and we are left with something that is close to pure atmosphere, then I think this part should be well modeled by a Gaussian process, which means that GP interpolation (e.g. kriging) should do a better job than polynomials.

This issue is to write an interpolator that implements this.

jmeyers314 commented 7 years ago

From a very cursory glance at knn_interp.py it looks like we could probably do something very similar here with sklearn.gaussian_process.

RobertLuptonTheGood commented 7 years ago

Rumour (== Dan Foreman-Mackay) has it that the next scikit-learn gaussian process code will be much better -- I'm not sure if it's just speed. I also don't know if the API changed.

jmeyers314 commented 7 years ago

... although it doesn't look like there's any way in sklearn (or in DFM's george) to regress several functions (say, for fwhm, g1, g2, ...) simultaneously under a common covariance function. For a first implementation then, I suppose independent regressors may have to do.

rmjarvis commented 7 years ago

I'll just mention a nice paper by Gentile et al that gives a good description of GP (aka kriging), radial basis functions, inverse distance weighting and other interpolation schemes.

They are related in form (they all can be described with the formula $\hat z = \sum_i \lambda_i z(x_i)$), but with some non-trivial differences in how $\lambda_i$ and $z(x_i)$ are calculated.

I'm also not sure which ones of these could be extended to have $z$ be a vector rather than a scalar.

jmeyers314 commented 7 years ago

Looks like Gentile et al are using independent regressors: Appendix A.1, step 2.

rmjarvis commented 7 years ago

Yes, that's how they did it. Some of the methods might be amenable to extension to vectors. But if not, probably PCAing the vectors and treating each principal component coefficient as a scalar would be appropriate.

For things like Kolmogorov with only a few parameters to fit (which are likely fairly orthogonal), there might not be much difference between the PCA approach and just the raw parameter values.

jmeyers314 commented 7 years ago

probably PCAing the vectors and treating each principal component coefficient as a scalar would be appropriate

That reminds me... I think I've actually done this before to make an optics PSF emulator. Just need to switch the input params from [Z2, Z3, Z4, Z5, Z6, ...] to [u, v]...

jmeyers314 commented 7 years ago

The development version of scikit-learn GaussianProcessRegressor does look like a significant improvement. I think I can cook up something on the latest stable version of sklearn pretty quickly, but we should definitely come back to this once the new sklearn is released.

jmeyers314 commented 7 years ago

What's the purpose of kNNInterp.getFitProperties()? Don't we always want to interpolate the entirety of star.fit.params?

cpadavis commented 7 years ago

You're very likely right. I can't think of a use case for it -- I think I ended up coding it that way since it looked just like the getProperties, only for star.fit instead of star.data. In fact, I think it's actually bad, because it means some fit parameters could be ignored!

Edit: The only use case I can come up with is that maybe some other version of knn interp might want to interpolate some other set of properties that we are not actively fitting with our model (ie something that is in star but isn't star.fit.params). But even that doesn't actually jibe with the way it is actually written.

On Wed, Sep 21, 2016 at 1:56 PM Josh Meyers notifications@github.com wrote:

What's the purpose of kNNInterp.getFitProperties()? Don't we always want to interpolate the entirety of star.fit.params?

— You are receiving this because you are subscribed to this thread.

Reply to this email directly, view it on GitHub https://github.com/rmjarvis/Piff/issues/14#issuecomment-248740353, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsw5IWs9eIH7m0ZpEJE_Tj7sagocFYvks5qsZnmgaJpZM4IPG8T .

jmeyers314 commented 7 years ago

I think I got the basic implementation of this with sklearn down now. Unfortunately, it doesn't work very well. The basic problem is that while sklearn gaussian_process will let you set the parameter that describes the noise variance in the regression (affectionately termed the "nugget"), there isn't any facility to optimize the log marginal likelihood w.r.t. the nugget in the same way that the log marginal likelihood is optimized w.r.t. the correlation length scale. It looks like george and development sklearn both do have this ability though.

So... I see a few options:

Let the user set the nugget. This requires some sort of guess for how noisy the star.fit.params are, but is really easy for us to implement. I'll have to think about how to propagate the user-set nugget if we turn on PCA.
Wrap existing sklearn optimization to also optimize the nugget.
Use george, which is partially pip-installable but also depends on eigen, which might be a bit too steep of a dependency here? (though should be fast!)
Use development sklearn before it's released.
Wait for new sklearn.

I'm tempted to start with 1, add 2, and then add 3 as an optional enhancement+dependency.

rmjarvis commented 7 years ago

It looks like you're testing it against a polynomial variation, rather than something that is actually generated as a Gaussian process. Do we expect it to work well in this case? Polynomials kind of violate some of the assumptions that GP makes I think.

You could generate a real Gaussian field using the PowerSpectrum class in GalSim. We normally use that for generating shear fields with some cosmological power spectrum. But you could also use it to generate size and shapes of PSFs that follow some input power spectrum. That might be a fairer test of a GP interpolator.

jmeyers314 commented 7 years ago

Do we expect it to work well in this case?

I think it ought to work well for any sufficiently smooth function.

You could generate a real Gaussian field using the PowerSpectrum class in GalSim.

Agree that this would be an interesting test case.

One thing I've noticed is that the GP looks like it's trying to smooth the data more and more each iteration. The results are much nicer looking if you don't actually iterate.

jmeyers314 commented 7 years ago

I've made some progress here. Caught a couple of bugs in Kolmogorov that were giving me problems before.

Here's the results for interpolating the g1 parameter for a Kolmogorov PSF, where the true g1 varies as a polynomial over the field (4 other PSF params not plotted are also varying over the field as polynomials):

Here's the same, but this time, the truth is generated from galsim.PowerSpectrum:

Not too good. But if you PCA the fitted PSF parameters first, and then only keep the first 2 PCs (which in this case explain ~99.3% of the parameter variance), you get:

I believe the improvement is due to the fact that the GP interpolator optimization step, which looks for a good correlation length, is better able to latch onto the signal (instead of noise) in the PCA case. The optimized inverse length scales are 0.002 without PCA, and 0.38 with.

I tried the same thing on the polynomial example, but there the difference was negligible, maybe due to the fact that the interpolation was already really good there.

cpadavis commented 7 years ago

Cool stuff!

When you PCA the fitted PSF parameters, do you mean over star.fit.params? Don't you have only 3 or 5 parameters for the Kolmogorov? (Are you doing the centroid adjustments?) I'm surprised that essentially throwing out one parameter makes such a dramatic improvement... Also, how are you generating the terms in the PS case? Do they each get their own PS? (if size, g1, g2 are generated independently, how can a linear combination of them do better? But on the other hand, if g1 and g2 are constrained by the overall ellipticity length, maybe the pca is picking up on that constraint.)

On Wed, Sep 28, 2016 at 5:16 PM Josh Meyers notifications@github.com wrote:

I've made some progress here. Caught a couple of bugs in Kolmogorov that were giving me problems before.

Here's the results for interpolating the g1 parameter for a Kolmogorov PSF, where the true g1 varies as a polynomial over the field (4 other PSF params not plotted are also varying over the field as polynomials):

[image: f1] https://cloud.githubusercontent.com/assets/3650485/18936655/fb4a5ad6-859d-11e6-8104-88f962db1f7f.png

Here's the same, but this time, the truth is generated from galsim.PowerSpectrum:

[image: f3] https://cloud.githubusercontent.com/assets/3650485/18936686/3d47deb8-859e-11e6-9e5c-ccea8cde44b7.png

Not too good. But if you PCA the fitted PSF parameters first, and then only keep the first 2 PCs (which in this case explain ~99.3% of the parameter variance), you get:

[image: f4] https://cloud.githubusercontent.com/assets/3650485/18936701/67c1df04-859e-11e6-94fe-653c8244c67b.png

I believe the improvement is due to the fact that the GP interpolator optimization step, which looks for a good correlation length, is better able to latch onto the signal (instead of noise) in the PCA case. The optimized inverse length scales are 0.002 without PCA, and 0.38 with.

I tried the same thing on the polynomial example, but there the difference was negligible, maybe due to the fact that the interpolation was already really good there.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rmjarvis/Piff/issues/14#issuecomment-250338004, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsw5P9ytULvjYGWPI1KvmEQf_0wqHx_ks5quwNCgaJpZM4IPG8T .

jmeyers314 commented 7 years ago

When you PCA the fitted PSF parameters, do you mean over star.fit.params?

Yep. I'm doing the 5 param version (centroid is part of star.fit.params).

if size, g1, g2 are generated independently, how can a linear combination of them do better?

That's a good point. I was lazy for the GP tests and just set the true fwhm and centering params to constants. I'll try again generating them from their own distributions though.

rmjarvis commented 7 years ago

I was discussing this thread with @pmelchior and @TallJimbo on Tuesday when I was up at Princeton, and they thought there should be some place where you figure out what the correlation function of the GP is. If you aren't setting something explicit, they though sklearn is probably assuming a Gaussian (or "squared exponential" is apparently a preferred term to avoid confusion with the "Gaussian" of the GP).

In the Gentile et al discussion of kriging, they calculate something they call a variogram, gamma(h), which is connected to the correlation function C(h) as gamma(h) = C(h) - C(0). I think from what Peter and Jim were saying about this that C(0) is what sklearn calls the nugget.

Anyway, sklearn might be doing something semi-reasonable without an explicit variogram (or correlation function), but I suspect we'll want to calculate it ourselves and provide it explicitly.

jmeyers314 commented 7 years ago

they thought there should be some place where you figure out what the correlation function of the GP is

Yep. This is done by the sklearn.gaussian_process.GaussianProcess.fit() method. Current sklearn only optimizes the correlation length of the squared exponential kernel (there are other kernels available, but I think squared exponential makes the most sense).

I'm less familiar with the "variogram" terminology, but you might be right that the nugget is at least related to C(0).

In the long run, I think we'll want at least some kind of prior over what kernel parameters are reasonable, potentially informed by something like observations of high stellar density fields. I don't know that we'll want to assert something outright, since it's quite likely the correlation function of PSF parameters varies exposure to exposure.

rmjarvis commented 7 years ago

I think squared exponential makes the most sense).

I don't think that's likely to be the appropriate kernel for the atmosphere. Why can't we just measure it directly from the data? We should have all the information we need to do so.

jmeyers314 commented 7 years ago

I don't think that's likely to be the appropriate kernel for the atmosphere. Why can't we just measure it directly from the data? We should have all the information we need to do so.

Hmmm. That's not something I've ever seen done with a GP, but could be interesting to investigate. We could certainly try estimating the parameters of a von Karman correlation function (essentially r0 and an outer scale) if a non-parametric correlation function proves too tricky.

jmeyers314 commented 7 years ago

Actually, from Heymans++12 eq. 8, it looks like a von Karman correlation function may be the same as a Matern(-5/6) correlation function.

pmelchior commented 7 years ago

It is indeed surprising: GP models depend profoundly on the covariance/correlation/variogram function, and yet nobody seems to fit it as part of the inference. It's often "our data looks like that" or "this is what worked best", whatever either of those statements mean. I haven't done a literature search, but I'd guess that this has at least been tried. My second guess is that this is a non-trivial modification of the GP likelihood. It's maybe easier to split this into an iterative solution where one

solves the regression problem with GP given a covariance
solves for the covariance given the residuals of the regression model

jmeyers314 commented 7 years ago

Actually, from Heymans++12 eq. 8, it looks like a von Karman correlation function may be the same as a Matern(-5/6) correlation function.

Although... it's not clear to me why the correlation of the ellipticities should look like the correlation in an instantaneous phase screen.

rmjarvis commented 7 years ago

I'm pretty sure the Gentile et al implementation of kriging just involved measuring the correlations on the data directly. I guess I'm not following why this is not an easy thing to do. Is it just that sklearn doesn't have an easy way to put in an empirical correlation function?

BTW, for DES we've lately been looking at the PSF patterns of the presumably atmospheric component and there is a clear directionality to the streaks, so we've been thinking that we might even want the correlation function to be anisotropic. This I expect would be difficult to incorporate, since I think the code probably expects the kernel to be isotropic (i.e. C(r,theta) == C(r)). But at least getting the radial function from the data seems like it should be straightforward.

pmelchior commented 7 years ago

Not sure about sklearn, but multivariate covariances are analytically treatable.

jmeyers314 commented 7 years ago

I'm pretty sure anisotropic covariances are possible with sklearn. The theta kwarg to GaussianProcess just becomes multidimensional.

jmeyers314 commented 7 years ago

@rmjarvis , I'm tooling up to try and answer your question about injecting a measured covariance function instead of optimizing a parametric one. This involves rolling a new GP interpolator by hand (sklearn won't do it), which I think I've got more or less running now, at least for the isotropic case. (Shouldn't be too hard to extend this to anisotropic).

TreeCorr seems like a natural place to do the measurement part. I only looked at this briefly, but it seems like this only handles isotropic correlation functions, right? How much work would it be to implement anisotropic correlation function measurements? Or is there another tool I should look at?

I suppose I can start in the meantime by injecting the correlation function used to generate fake data into the GP and comparing that to an optimized parametric correlation function GP.

rmjarvis commented 7 years ago

TreeCorr seems like a natural place to do the measurement part. I only looked at this briefly, but it seems like this only handles isotropic correlation functions, right? How much work would it be to implement anisotropic correlation function measurements? Or is there another tool I should look at?

Yes, TreeCorr only does isotropic. Although, for the number of stars we have, I suspect the tree algorithm isn't that much of an advantage. Just doing the brute force correlation as a direct sum will already be fairly fast. (Probably even in python.) So for the anisotropic binning, we can just do that.

So probably go ahead and use TreeCorr for now for the isotropic one. I think we can leave the anisotropic one for a later issue. When we get to that, we can try to decide whether it makes more sense to add that feature to TreeCorr or just implement the direct sum in Piff.

gbernstein commented 7 years ago

I've done the 2d 2PCF with all pairs in numpy (histogram2d) successfully. Takes a minute or so on my laptop to do all pairs of ~20k objects. Let me know if you want to see any code - it's nothing clever or fancy.

aaronroodman commented 7 years ago

yes that would be useful - otherwise I was going to do the same…

On Oct 6, 2016, at 10:20 AM, Gary Bernstein notifications@github.com<mailto:notifications@github.com> wrote:

I've done the 2d 2PCF with all pairs in numpy (histogram2d) successfully. Takes a minute or so on my laptop to do all pairs of ~20k objects. Let me know if you want to see any code - it's nothing clever or fancy.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/rmjarvis/Piff/issues/14#issuecomment-252030209, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AEEa85R5lCsHVuTQlwZhgVHbqCpcSRWwks5qxS3ugaJpZM4IPG8T.

Prof. Aaron Roodman Chair Dept. of Particle Physics & Astrophysics SLAC National Accelerator Laboratory Stanford University

SLAC National Accelerator Laboratory E-mail: roodman@slac.stanford.edumailto:roodman@slac.stanford.edu 2575 Sand Hill Rd. Phone: 650-926-2705 MS 29 Menlo Park, CA 94025 URL: http://www.slac.stanford.edu/~roodman

gbernstein commented 7 years ago

def vcorr2d(x,y,dx,dy,
            rmax=1., bins=513,
            maxpts = 30000):
    """
    Produce 2d 2-point correlation function of total displacement power
    for the supplied sample of data, using brute-force pair counting.
    Output are 2d arrays giving the 2PCF and then the number of pairs that
    went into each bin.  The 2PCF calculated is
    xi_+ - <vr1 vr2 + vt1 vt2> = <vx1 vx2 + vy1 vy2>
    Note that each pair is counted only once.  So to count all pairs one can
    average xi_+ with itself reflected about the origin.
    """

    hrange = [ [-rmax,rmax], [-rmax,rmax] ]
    if len(x) > maxpts:
        # Subsample array to get desired number of points
        rate = float(maxpts) / len(x)
        print "Subsampling rate {:5.3f}%".format(rate*100.)
        use = np.random.random(len(x)) <= rate
        x = x[use]
        y = y[use]
        dx = dx[use]
        dy = dy[use]
    print "Length ",len(x)
    # Get index arrays that make all unique pairs
    i1, i2 = np.triu_indices(len(x))
    # Omit self-pairs
    use = i1!=i2
    i1 = i1[use]
    i2 = i2[use]
    del use

    # Make separation vectors and count pairs
    yshift = y[i2]-y[i1]
    xshift = x[i2]-x[i1]
    counts = np.histogram2d(xshift,yshift, bins=bins, range=hrange)[0]

    # Accumulate displacement sums
    v =  dx + 1j*dy
    print 'xiplus' ##
    vv = dx[i1] * dx[i2] + dy[i1] * dy[i2]
    xiplus = np.histogram2d(xshift,yshift, bins=bins, range=hrange, weights=vv)[0]/counts

    return xiplus,counts

This is a vector version (dx and dy) but a scalar version just drops dy.

jmeyers314 commented 7 years ago

Thanks @gbernstein . That works nicely.

I've coded up a GP that can use nearest neighbor interpolation of an empirically measured correlation function. I keep running into problems trying to Cholesky solve the implied covariance matrix though. (See algorithm 2.1 here). Cholesky seems like the way to go for a symmetric positive semidefinite matrix, but maybe there are more stable algorithms?

I wonder if part of the problem, and maybe the reason people tend to use parametric covariance functions in general, is that to work as a covariance, a function must be positive semi-definite. (See equation 4.2 and surrounding text here). It's not obvious to me one way or the other if a measured correlation function will always properly behave this way, and I think it's even less clear if any kind of interpolation of a measured covariance will behave well. Note that the Kriging literature, and Gentile, et al. in particular, also appear to use parametric forms for their covariograms -- the so-called "authorized" covariograms.

TallJimbo commented 7 years ago

Naive Cholesky won't work unless you can guarantee positive definite (no semi-). There are symmetric indefinite factorizations that would be ideal, but they're sometimes hard to find, and a symmetric Eigensolver or SVD will also do the job (less efficiently).

rmjarvis commented 7 years ago

There are symmetric indefinite factorizations that would be ideal, but they're sometimes hard to find, and a symmetric Eigensolver or SVD will also do the job (less efficiently).

TMV has that, so if you're willing to call out to a C/C++ function to do some work, we could use that. It does an L D LT decomposition, where L is a lower triangle matrix with unit diagonal, and D is a block diagonal matrix with 1x1 and 2x2 blocks. Any symmetric matrix can be so decomposed. If positive semi-definite, then I think the diagonals will end up all 1x1 blocks. Negative eigenvalues imply some 2x2 blocks.

jmeyers314 commented 7 years ago

I think I have the GP using the measured covariance function more or less working now. I ended up using numpy SVD, and had to put an upper limit on the singular values to robustly invert the matrix. (Maybe TMV can do that too?) There are a lot of potential variations to investigate, so I don't think I can say yet how this compares to using a parametric form for the covariance, but it works well enough that I think it's worth including here as an option.

It occurs to me that we're going to have to do something more sophisticated to use GPs on real data. Thinking ahead to LSST, for each exposure we'd like to interpolate the PSF measured from ~10^4.5 stars to the positions of ~10^6 galaxies, which formally implies inverting at least one ~10^6 x 10^6 matrix, and maybe more if we think the covariances of distinct PSF parameters are different. We can probably reduce the complexity by independently working on subregions of the focal plane, but that will require some more thought/experimentation. One thing that might help is if there are any clever ways given a constant symmetric NxN matrix to then invert many symmetric N+1 x N+1 matrices each with one additional column/row.

rmjarvis commented 7 years ago

One thing that might help is if there are any clever ways given a constant symmetric NxN matrix to then invert many symmetric N+1 x N+1 matrices each with one additional column/row.

There is for the Cholesky case. Also for the case where your covariance matrix is defined as X.T * X for some matrix X. (You can do a QR decomposition: X = QR rather than directly calculating S = X.T \ X and then doing Cholesky. It's both faster and more numerically stable.)

In the latter case, you can think about this as adding a single row to X and finding the new R value (upper triangle matrix, corresponding to L.T in the normal Cholesky presentation).

And yes, TMV does this. :) It's even one of the few algorithms where TMV is faster than LAPACK.

jmeyers314 commented 7 years ago

Umm actually.... I think I meant invert a Nstar x Nstar matrix, so maybe not quite as hard as I was thinking.

Also, just noticed that the latest scikit-learn (0.18) was released two weeks ago, and has many nice new GP features. It's probably worth finding out how this fares with ~10^4.5 training points and ~10^6 prediction points.

Two notable omissions though:

their radial kernels can have anisotropic scale-lengths, but only ones that are axis-aligned.
no directly measured and asserted covariances.

The code looks quite understandable though, so I'm optimistic I could monkey patch these into the rest of the sklearn framework.

rmjarvis commented 7 years ago

I'd bet they would consider pull requests too if it's close but needs a feature or two.

jmeyers314 commented 7 years ago

Quick note:

For 35000x35000 matrix (i.e., 35000 training PSFs, aka stars):

time to cholesky in george using default solver (i.e., using numpy.linalg): 600s time to cholesky in george using HODLR solver (see http://dan.iel.fm/george/current/user/solvers/#george.HODLRSolver): 1s

I haven't explicitly timed it yet, but I believe sklearn is also using numpy's cholesky, so probably similar speed to the slow version of george.

Since I think 35,000 is roughly the size of problem we're anticipating for LSST, I'm now advocating we go with george in spite of its additional prerequisites (eigen).

RobertLuptonTheGood commented 7 years ago

eigen is a pretty lightweight dependency (it is of course header only).

gbernstein commented 7 years ago

Wow, that's insanely fast!

jmeyers314 commented 7 years ago

Ugg. Looks like I was a little too quick to speak (again...).

While HODLR is indeed ~600x faster for some covariance functions, it seems the relative performance of HODLR and np.linalg depends a lot on the scale-length of the covariance. Weirdly, HODLR seems at its best the more dense the covariance matrix is (the longer the scale-length is). Will keep investigating... Regardless, it would be nice to have the option of testing different solvers using george.

rmjarvis commented 7 years ago

eigen is a pretty lightweight dependency (it is of course header only).

But it's not pip installable. For now at least, that's a requirement of piff.

jmeyers314 commented 7 years ago

The interfaces for sklearn and george are pretty similar. I bet it's not too hard to just wrap both.

It's pretty hard to say whether it's worth having the HODLR solver, or worth making other kinds of approximations, without some actual data available to see what the actual correlations look like.

jmeyers314 commented 7 years ago

I think I got the anisotropic radial basis function kernel working (in sklearn). Check it out:

anisotropicrbf

The interpolator is taking the 100 points in the left panel, optimizing the parameters of the 2D covariance matrix (+ nugget), and then making predictions. There's also a hidden step here where mildly noisy postage stamp images of Kolmogorov PSFs are being drawn and fit.

jmeyers314 commented 7 years ago

Hey @rmjarvis. Just pinging you again about this issue. Any more comments?

rmjarvis commented 7 years ago

Thanks for the reminder. For some reason, I thought I was waiting for you for something on this. It looks fine to me. There was one more test that I had to loosen the tolerance for slightly on my laptop, but I think it's probably fine. I'll go ahead and merge it.

jmeyers314 commented 7 years ago

No worries. My DM overlords will be happy to hear, and push around story points :)

rmjarvis / Piff

Gaussian process interpolation #14