Add demo notebook - Githubissues

Add a demo notebook to demonstrate the capabilities of the class.

Demo notebook has been updated substantially. Please take a look and make edits as you see fit.

Two notes:

The BIC test takes a very long time (upwards of 30 minutes). A "To be deleted" code box has been added; run that one instead of the BIC test one to skip actually running the test but have all the output available. The rest of the demo should be fairly quick.
The BIC results change when I rerun them --- the best value for n_components ranges from 5-7. 7 components seems to provide the best fit, so I went with this and attempted to explain this in the text. feel free to update if this is unclear.

Tom, this is great! You are generating supernovae at realistic positions with realistic SALT2 parameters :-) @rbiswas4 will be delighted! I see that only light editing of the text is required, I'll do that.

This is awesome!

I tried running it but got stuck at the corners ... probably some kind of version incompatibility, that I have not sorted out. You should add your XDGMM package as a requirement somewhere.

This is also timely: Aside from trying to use this in SN simulations at a catalog level, we would also like to use this for images. And as @drphilmarshall might have told you, we are at a point where we are discussing plans for a new Twinkles simulation. Do you think it would be worth discussing (a) if we could use empericSN for that purpose and (b) how to make sure we have all the 'requirements' covered? Maybe over a telecon at some point?

Here are a few questions/comments:

It seems you are training on the SALT parameters (x0, x1, c). Now, x0 is roughly F/d_L^ 2 where F is related to the intrinsic brightness of the SN, and d_L is the luminosity distance. This means that 2 SN of the same character at two different redshifts would have very different x0 values. With a large enough amount of training data which incorporates the host redshift, one may perhaps learn the relevant x0, values, but a less ambitious but still useful goal (achievable with less data) would be to use $x_0 * d_L^2$ with $d_L^2$ unsatisfactorily having to be calculated from a model like Planck15 with LCDM. Have you thought about trying this? And in your current model, what happens if you ask for SN parameters on a galaxies with different redshifts (Ie. checking if you might have achieved what I called the more ambiitous goal)? This would involve taking a sample of galaxies over a range of redshifts (0-1.2), obtaining x0 means and uncertainties in redshift bins and seeing if something like -2.5 * log10(x0) ~ mu + const is acceptable, (though the scatter is somewhat large ~ 1 mag).
Location of SN: I see a histogram of radius, but we will also need a parametrized angle (galaxies are often like ellipses in 2D), so, how should we think about minor and major axes. Have you thought abouta comparison of the location probability distributions obtained from empericSN to the naive SN follows light prescription sampling of the same galaxy? This should be doable if the data source lists stuff like sersic indices.
Sensitivity : How sensitive is this to changes in the SALT parameters (training)? I would suggest replacing the Sullivan et. al values by our JLA values. (They incorporate improvement in both the SALT model and (much more importantly) the calibration. It would be interesting to see what the changes are due to such changes in training data.
Can we see comparisons of both global and conditional distributions of SALT parameters as functions of galaxy types (This is not what you were training on, but it would be interesting to see the correlations learned)?

Thanks Rahul! I'd better leave the technical questions to @tholoien. I think getting some empiricSN into Twinkles is an excellent idea - let's do it! I'm sure Tom would love to help out if we get stuck (as we almost certainly will :-).

Hi guys, sorry for the slow response, been traveling. Thanks for the detailed comments Rahul. My responses to your questions are below:

The sample we trained the model on has ~1400 SNe ranging out to a bit beyond redshifts of 1 (from SNLS and SDSS), and there is definitely a correlation between redshift and the x0 parameter. I trained on x0 directly by design, as hosts with different redshifts definitely do give different x0 values. My thinking on this was that we want to be able to get x0, x1, and c given the host parameters (redshift, separation, color, and local surface brightness), and redshift allows the model to narrow down the acceptable x0 values by quite a bit. We could certainly try training it using a different quantity though.
The SDSS host photometry, which is what I used for all the host information, provides only information for an exponential or de Vaucouleurs profile fit to the host photometry, and this is what we used to train the model. So all the surface brightnesses and radii come from those fits. The SDSS profile fits do give B/A ratios and position angles (rotation in the plane of the sky), so we could try to incorporate those into selecting a proper position. I'm not sure what the best way to do this would be...I don't think we want to include those quantities in the model fit necessarily, but perhaps we could do something like select a radius in the way we currently do it, and then use the angle and axis ratio to somehow select an actual position in the host. I'll have to think on that for a bit.
For how sensitive the model is to changes in the SALT parameters, I'm not sure, since I've only ever trained it on the existing ones. It would be very easy to swap out the Sullivan et al. ones with the JLA ones and redo the fit to compare, so if you can point me to a good place to get those, I can do that. Would there be SALT parameters for both the SDSS and SNLS samples?
I could definitely produce plots showing the distributions of any of the host or SN parameters used in the fit with respect to each other. If you take a look at the PlotCorr notebook in the repo, that contains plots of the SN parameters vs. all the host properties we used to train the model, taken directly from the data. It would be easy to sample a few thousand data points from the trained model and plot those results too, if that's more what you're looking for. In theory, the XDGMM model should recover the underlying "true" distribution from the noisy data used in the fit, so it could give a better sense of what the actual distribution looks like.

I would be happy to discuss incorporating empiriciSN into the next Twinkles simulation via Skype or phone at some point. I am going to be fairly busy in the coming weeks catching up on some things related to my thesis that have been waiting over the summer, and I am going to be applying for jobs in the Fall, so I think I would prefer to keep my involvement to a minimum, but I am definitely willing to work with you guys to make this work for you---that was one of our primary goals in making it! I think probably the best solution is going to be discussing what exactly you would need the tool to produce for you, and then we can tweak things as necessary to make it work.

The sample we trained the model on has ~1400 SNe ranging out to a bit beyond redshifts of 1 (from SNLS and SDSS), and there is definitely a correlation between redshift and the x0 parameter. I trained on x0 directly by design, as hosts with different redshifts definitely do give different x0 values. My thinking on this was that we want to be able to get x0, x1, and c given the host parameters (redshift, separation, color, and local surface brightness), and redshift allows the model to narrow down the acceptable x0 values by quite a bit. We could certainly try training it using a different quantity though.

I am enthusiastic about getting SN properties from this method. But the question is whether to train on intrinsic propetries of SN or observed properties of SN that reflect the intrinsic properties + cosmology, effectively giving you a much harder problem of learning both the distribution of intrinsic properties and the cosmology. Providing a cosmology will bias your results (if the cosmology is wrong), but in what we will be using this (ie. simulation), everything will be wrong if the cosmology does not make sense! So I am not too worried about the possibility of bias. I would worry that the method would reliably explore the distributions without this additional prior.

But, we could test out these statements: I think what we would need to have is a sample of test galaxies spanning a redshift range of 0-1.2 (say).

What are the minimal set of features you think the galaxies must have to be good candidates for such tests? We could use your model to draw x0, x1, c, and start comparing histograms of the quantities of interest. I would be happy to help in constructing or performing such tests. Let me know how you think I can best help.

The SDSS host photometry, which is what I used for all the host information, provides only information for an exponential or de Vaucouleurs profile fit to the host photometry, and this is what we used to train the model. So all the surface brightnesses and radii come from those fits. The SDSS profile fits do give B/A ratios and position angles (rotation in the plane of the sky), so we could try to incorporate those into selecting a proper position. I'm not sure what the best way to do this would be...I don't think we want to include those quantities in the model fit necessarily, but perhaps we could do something like select a radius in the way we currently do it, and then use the angle and axis ratio to somehow select an actual position in the host. I'll have to think on that for a bit.

I think that thinking sounds good to me.

For how sensitive the model is to changes in the SALT parameters, I'm not sure, since I've only ever trained it on the existing ones. It would be very easy to swap out the Sullivan et al. ones with the JLA ones and redo the fit to compare, so if you can point me to a good place to get those, I can do that. Would there be SALT parameters for both the SDSS and SNLS samples?

Yes, it should be easy. This is the set I would recommend: http://cdsarc.u-strasbg.fr/vizier/ftp/cats/J/A+A/568/A22/tablef3.dat This has parameters for all the SNLS supernovae used in cosmology fits which seems to be what you were using. It has parameters for SDSS SNIA supernovae used in JLA (a total of ~500 if I recall correctly), but not the ~1400 SN that you have (Are those photometrically identified?).

Note: Another thing is that SNLS and SDSS often use different conventions for x0 (an easy way to check would be to see if the x0 values of the SDSS supernova in the above link (what I would call the SNLS/SALT convention) are systematically different from the ones you were using for SDSS by a factor. You should then change the x0 values of all SDSS supernovae to account for this convention difference.

I could definitely produce plots showing the distributions of any of the host or SN parameters used in the fit with respect to each other. If you take a look at the PlotCorr notebook in the repo, that contains plots of the SN parameters vs. all the host properties we used to train the model, taken directly from the data. It would be easy to sample a few thousand data points from the trained model and plot those results too, if that's more what you're looking for. In theory, the XDGMM model should recover the underlying "true" distribution from the noisy data used in the fit, so it could give a better sense of what the actual distribution looks like.

I have looked at that notebook, and I think the essential additions that I am interested in are:

if it could be distributions (histograms or contour plots) rather than scatter plots
Marginalized distributions of the SALT2 parameters, except aside from x0, I would be interested in looking at the quantities -2.5 * log10(x0 * dL*2(z)) (This is like an absolute magnitude) -2.5 * log10(x0) + 0.13 * x1 - 3.1 \ c - mu(z) (This is like a constant + scatter )

I would be happy to discuss incorporating empiriciSN into the next Twinkles simulation via Skype or phone at some point. I am going to be fairly busy in the coming weeks catching up on some things related to my thesis that have been waiting over the summer, and I am going to be applying for jobs in the Fall, so I think I would prefer to keep my involvement to a minimum, but I am definitely willing to work with you guys to make this work for you---that was one of our primary goals in making it! I think probably the best solution is going to be discussing what exactly you would need the tool to produce for you, and then we can tweak things as necessary to make it work.

OK. Maybe @drphilmarshall and I should try settling on what we want and give your model a shot and get back to you when we are stuck (Of course we will keep you informed of our attempts!)

Hi Rahul (and Phil),

I apologize for being slow to respond, I've had some family matters come up in the last week and had to travel unexpectedly.

I have been making some edits to finalize the XDGMM class for a paper we are writing up on it, and I want to do the same for EmpiriciSN now. I know the next Twinkles simulation is happening, so I wanted to get in touch to see if we can make this work and get it incorporated before it's too late. (If possible.)

In response to your message, is there a column description for the table you linked to of SALT2 parameters? I tried digging around on Vizier but couldn't find it. I am using the SDSS supernovae classified as "SNIa" (spectroscopically confirmed) or "zSNIa" (photometrically identified with a host redshift). I am hesitant to change the SALT parameters if it means drastically reducing the size of the dataset, since it's already a little on the small side, and I want the SALT parameters all coming from the same source for consistency, so my inclination would be to not change them if I can't find them for the whole sample.

For the x0 parameter, the SDSS ones come from the Sako et al. dataset, while I actually calculated the SNLS ones myself. The SNLS source I used provided x1, c, a redshift, and a peak rest-frame B-band magnitude, so I used SNCosmo to calculate the x0 parameter from those.

Anyway, my question now is what are the key things you would need changed/updated to incorporate EmpiriciSN into Twinkles at this point? (e.g., what needs to be done now vs. what tests/etc. would need to be run at some point, but aren't needed right now?) Depending on how much needs to be done, maybe it won't be possible to incorporate it this time, but I would like to get EmpiriciSN into a somewhat final "production state" so that I can write about it in our paper. That basically means I want to have all the necessary functions (fitting a model, choosing a radius, sampling SN parameters) working; the datasets can always be changed later.

@tholoien

Thanks for getting in touch ... everyone is busy with things that have to be taken care of, so that is perfectly understandable.

Let us split this issue into three threads to keep track of it.

[ ] What is necessary for Twinkles (I think this is the lowest bar, because we are not planning on doing a cosmology analysis, and our plans are a combination of making things somewhat realistic and what would aid our computations. So, we need some variables, and have an idea of how realistic it is, but we don't need things to be very realistic.)
[ ] Validating EmpericSN through tests and benchmarks. This is necessary for incorporating this into SN simulations that will be used for cosmology analysis, and I suppose it would be important if you plan to write a paper on empericSN.
[ ] Questions on methodology: data, x0 values, Intrinsic properties of SN

I will start the Twinkles thread, and we can start discussing the other ones.

tholoien / empiriciSN

Add demo notebook #18