pmelchior / scarlet

hyperspectral galaxy modeling and deblending
MIT License
52 stars 22 forks source link

Create PixelCNN source #128

Open fred3m opened 5 years ago

fred3m commented 5 years ago

Using @EiffL 's code we will create a source that uses a PixelCNN to constrain morphologies.

EiffL commented 5 years ago

Code in branch pixelcnn seems to be working at least for single band observations and without specifying PSFs: image

pmelchior commented 5 years ago

Nice! @EiffL I see that you branched off of adam, which is the right place. I guess we need to talk about centering of galaxies, though, because your pixelcnn assumes that galaxies are in the middle of the image, right?

EiffL commented 5 years ago

So what we did with Fred is to extract a centered postage stamp from the frame, and that's what's fed to the pixelcnn. The position of the postage stamp is updated at each iteration based on the current center estimate.

On Fri, Jul 19, 2019, 5:16 PM Peter Melchior notifications@github.com wrote:

Nice! @EiffL https://github.com/EiffL I see that you branched off of adam, which is the right place. I guess we need to talk about centering of galaxies, though, because your pixelcnn assumes that galaxies are in the middle of the image, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fred3m/scarlet/issues/128?email_source=notifications&email_token=AAGSLF634QIDI4YA6X6T3KDQAHLEHA5CNFSM4IFD2TDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L5VRI#issuecomment-513268421, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGSLF5RJ3GYL5IFBUKFQ7DQAHLEHANCNFSM4IFD2TDA .

EiffL commented 5 years ago

This being said, we thought of a few other problems.

On Fri, Jul 19, 2019, 5:18 PM Eiffel fr.eiffel@gmail.com wrote:

So what we did with Fred is to extract a centered postage stamp from the frame, and that's what's fed to the pixelcnn. The position of the postage stamp is updated at each iteration based on the current center estimate.

On Fri, Jul 19, 2019, 5:16 PM Peter Melchior notifications@github.com wrote:

Nice! @EiffL https://github.com/EiffL I see that you branched off of adam, which is the right place. I guess we need to talk about centering of galaxies, though, because your pixelcnn assumes that galaxies are in the middle of the image, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fred3m/scarlet/issues/128?email_source=notifications&email_token=AAGSLF634QIDI4YA6X6T3KDQAHLEHA5CNFSM4IFD2TDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2L5VRI#issuecomment-513268421, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGSLF5RJ3GYL5IFBUKFQ7DQAHLEHANCNFSM4IFD2TDA .

pmelchior commented 5 years ago

Good that Fred and you caught up. Shifting is solved then.

I think we can make sure that the sizes are always 64x64.

As for the normalization, we could do that too as part of the preparation of the image for the prior, but it's not that simple to automate. Our SED are total fluxes in observed units, so the conversion to 1s of HST i is not trivial in general. I know that you want to have the relative flux preserved because morphology can change with flux, but have you tested that assumption? It would be much more straightforward to normalize peak intensity to one.

EiffL commented 5 years ago

Hum... I don't think normalizing peak intensity is practical because it would change the level of the background noise in the images. But, if we don't particularly constrain the normalization of neither SED nor morphology, it should in principle sill work: the prior will be breaking the degeneracy. Now, the only problem is to ensure that the first guess solution for the morphology is in the right ball park in terms of fluxes to be compatible with the prior.

I'll make a few tests to see how all this behaves in practice.

fred3m commented 5 years ago

We also need to do a more thorough investigation of how to calculate our gradient step sizes. While ADAM isn't too sensitive to the choice of step_sizes, in the image shown above we were getting nonsense results initially because there was a difference in 4 orders of magnitude between the pixel CNN gradients and the likelihood gradients. So we just scaled the pixel CNN gradients down to make it work but there is no guarantee that the step size ratio will be the same for all images, in fact it will almost certainly be different depending the spectral norm of A and S.

pmelchior commented 5 years ago

Adam expects the step sizes in data units, i.e. for the morphology something like 0.01 should be fine, and for the SED something of order 1% of the expected amplitude. Less to be on the safer side, more to be faster.

EiffL commented 5 years ago

But Fred's point is that the Lipschitz constant of the prior term is 10^4 larger (or smaller, I don't remember which way it goes) thank that of the likelihood term. Hence, the effective step size of an Adam update is 10^4 smaller than without the prior, and the effect of the likelihood term is very small. Without scaling the gradient of the prior, we would probably have had to run for 10^6 iterations to converge to the same solution we got in 100 iterations.

There might be a proper way of properly scaling the two gradient terms, otherwise this heuristic of scaling the prior to have gradients of similar amplitude might work well in practice...

On Fri, Jul 19, 2019, 9:33 PM Peter Melchior notifications@github.com wrote:

Adam expects the step sizes in data units, i.e. for the morphology something like 0.01 should be fine, and for the SED something of order 1% of the expected amplitude. Less to be on the safer side, more to be faster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fred3m/scarlet/issues/128?email_source=notifications&email_token=AAGSLFZWAZUVOBP3FF5HARLQAIJHPA5CNFSM4IFD2TDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2MRSPI#issuecomment-513349949, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGSLF336PG6LPJ2W6CEW6TQAIJHPANCNFSM4IFD2TDA .

pmelchior commented 5 years ago

This would point to an improper definition of the likelihood or the priors. When inverse variance weights are set, the likelihoods should be proper, so you need to check the prior normalization.

None of that is related to the step size in Adam, which is externally set and independent of the amplitude of the gradients. That is the clever thing about it.

pmelchior commented 5 years ago

In other words, you don't need Lipschitz constants for the prior, simply computing the grad is sufficient

EiffL commented 5 years ago

I have relocated development to a new branch u/EiffL/pixelcnn where I'm changing a little bit the prototype we had built last week with Fred. And it's now based on the latest version of adam

Just ran a test, seems to work nicely :-)

pmelchior commented 5 years ago

Thanks, this looks pretty good. Do you have the trained pixelCNN for us to play with?

Am not (yet?) a fan of putting tf code into scarlet's main classes. I'd rather see it contained in its respective Prior class implementation. This way we don't force users to have it. I'm also not sure of the get_centered_ROI method should be in MorphParameter instead of the MorphPrior where it's functionally needed.

In terms of creating a batch: could the get_centered_ROI method be written in tf and be part of the mini-batch processing?

EiffL commented 5 years ago

I have packaged one of the trained pixelcnns I've been using and hosted it on the scarlter-pixelcnn repo. Here is how one would use this stuff:


import scarlet
from scarlet.prior import MorphologyPrior

module_path='https://github.com/pmelchior/scarlet-pixelcnn/blob/master/modules/pixelcnn_out.tar.gz?raw=true'

im = fits.getdata('blend.fits')
im1 = fits.getdata('gal1.fits')
im2 = fits.getdata('gal2.fits')

prior = MorphologyPrior(module_path)
frame = scarlet.Frame((1,65,65))
weights = im.reshape((1,65,65))*0+1./0.01147
observation = scarlet.Observation(im.reshape((1,65,65)), weights=weights).match(frame)

sources = [scarlet.ExtendedSource(frame, (32, 32), observation, bg_rms=array([0.03]), prior=prior, symmetric=False, monotonic=False),
           scarlet.ExtendedSource(frame, (32, 42), observation, bg_rms=array([0.03]), prior=prior, symmetric=False, monotonic=False)]

blend = scarlet.Blend(sources, observation)

blend.fit(500, e_rel=1e-6, step_size=1e-3)

where gal1.fits, gal2.fits, and blend.fits are hosted there: https://github.com/pmelchior/scarlet-pixelcnn

One thing to note is that if I don't downscale the strength of the prior, it now works (instead of giving garbage as what we were seeing last week) but the images are over-regularized, it actually decreases the fit to the data. Still need to make sure we are properly scaling all the terms.

Concerning the Tensorflow stuff, the problem is that it makes more sense to have the blend object take care of Tensorflow sessions, just like it's already blend.fit() which takes care of computing the gradients with autograd. In principle you could have several priors, they each define a different tensorflow function, but the execution environment is handled by a different class.... Note that in TF2, things may be different, since there are no sessions to keep track of...; But anyway, the way I've written it, the tensorflow module only gets imported if needed. Under "normal operations" the import is hidden.

As for get_centered_ROI, yes I think it's fine to move it elsewhere if you want, maybe more as a function than a class member if it's not associated with the MorphParameter class. I don't have strong feelings about it :-)

EiffL commented 5 years ago

Small addition to the previous post, I have added inverse variance weights in the Observation and removed in the code the arbitrary scaling factor we had on the prior term. I orginally was confused because I thought the bg_rms term in ExtendedSource would end up defining the variance in the l2 loss function, turns out it's the weight argument in observation that plays this role.

Why not defining bg_rms in Observation, and then each source can use that to initialize itself? Is there a use case where different sources might have different bg_rms, that would also be different from the likelihood variance?

In any case, here are the results with this setting: image and each component compared to the true input galaxy: image

image

pmelchior commented 5 years ago

You're doing it correctly now. ExtendedSource has an implementation that predates Observation, so we haven't closed the loop. One concern is that weight is an optional argument, but bg_rms is required for the source initialization, but we can raise an exception.

Would you open an issue in scarlet to make that adjustment. Simply link to this thread.

pmelchior commented 5 years ago

Beside the weight vs bg_rms discussion we'll address in #129, and improve the packaging and dependencies later on, there are important changes we need.

esheldon commented 5 years ago

The noisy models are gone but I'm still seeing artifacts in the models

artifacts

esheldon commented 5 years ago

If I set e_rel=1e-7 the artifacts go away models-e7

Residuals look good residuals-e7

esheldon commented 5 years ago

I just realized however that the code ran for the max of 500 iterations, and took 2 minutes

EiffL commented 5 years ago

@esheldon Great that you got it to work. Another important factor here is the step size used with the optimizer. At 1e-3 optim is unstable, at 1e-4 it works much better. On my laptop, I get convergence (1e-7) at step_size=1e-4 in 20s and ~300 iterations. Still not exactly "fast", but there are bunch of amortized inference techniques we could be using to speed this up...

@pmelchior Is there a maximum number of iterations that would be a deal breaker? i.e. can we decouple the optimization speed from the quality of the inverse problem? I would say we can, but if you disagree, we can look to the unrolled optimization procedures that people use in medical imaging nowadays for fast inverse problem solving.

@pmelchior Concerning handling different sizes of observations, it's already in place. The code extracts postage stamps around the sources, with sizes that match the prior. As for flux normalization, can we adopt some LSST scale? idk what that may be. The "good thing" about these standardized HST fluxes is that the pixel values are already <1 hence quite amenable to neural network optim.

fred3m commented 5 years ago

@EiffL I think that the maximum number of iterations and maximum run time depends a lot on how this scales with the number of pixels and the number of objects in the blend.

Do you know how this scales with pixel resolution (pix/arcsec)? If we create these templates at LSST resolution (~.2 arcsec/pixel) but convolve them with a narrow PSF as compared to LSST seeing (sigma~.8-.9) that should make a huge difference, right? Naively I would think that the network speeds things up linearly with the number of pixels in a template/postage stamp but if you think that the relationship is more complicated I wouldn't be surprised.

To have an idea about the target run time, the requested goal by LSST was initially 10ms/band/source in the blend (so ~100ms for a two object blend in 5 bands), although I think that we've successfully been able to convince them to allow more time for deblending. I imagine that 10s per object will be way too high but if it runs at least 20 times faster when trained on LSST/HSC resolution templates then we should be ok. However, if this works as well on blends with a large number of sources across multiple bands as it looks like it might, given the difficulty of deblending in LSST we might be able to convince the powers that be that the extra time is worth it.

@pmelchior will have to speak for joint processing and what their expectations of run time are but he is on vacation for the next two weeks.

EiffL commented 5 years ago

Thanks for this @fred3m , So I think the pixel resolution is not going to make a lot of difference, there is a big overhead in running the tensorflow session, which is probably taking more time than computing the gradients itself.

This being said, at lower res, with fewer details the optimization problem will be much faster.

Ok, I'll keep in mind the 100ms target :-) It's probably achievable, but maybe not rightaway...

EiffL commented 5 years ago

Almost done addressing the issue linked above. Demo notebook of training a pixel cnn from cosmos images at HSC resolution using distributed cloud TPU training :-) https://colab.research.google.com/drive/1p-DOHO6aU1ToPG4lca_pjHyHMt7sxs5l

EiffL commented 5 years ago

So, after a weekend of experimentation on multiband images here is a list of things that are not yet perfect:

Here is an example of first guess using scarlet: image image image

And updated after minimization with pixelcnn prior: image image image

In this example, I used morph_max SED normalization, and I downweight the likelihood by an arbitrary 0.001 factor.

herjy commented 5 years ago

This looks really nice! When you say you trained on HST so applying to HSC effectively deconvolves, you still explicitly account for the convolution in the inverse problem? It looks like there is some ringing in model 1. Any idea where this comes from? Also, I wonder if, for initialisation scarlet could be ran without symmetry. It would be faster and would probably still get us where we need to be.

EiffL commented 5 years ago

Yes yes yes, and I need you help @herjy :-) I tried to build the training set at the HSC pixel resolution, with a standardized effective, PSF, so that all galaxies in my training set have the same PSF. But it's not working well, I have aliasing.

Here is the notebook I used to build an effective PSF for the training set: https://github.com/ml4astro/galaxy2galaxy/blob/u/EiffL/pixelcnn/notebooks/BuildingCOSMOS_PSF.ipynb

Then the images for the training set are generated using GalSim here: https://github.com/ml4astro/galaxy2galaxy/blob/01cffaa59da229e93f9e839a54cf18443f8b914c/galaxy2galaxy/data_generators/cosmos.py#L101