EmpericiSN methodology and datasets

rbiswas4 commented 8 years ago

`x0 values used.`

I am using the SDSS supernovae classified as "SNIa" (spectroscopically confirmed) or "zSNIa" (photometrically identified with a host redshift). I am hesitant to change the SALT parameters if it means drastically reducing the size of the dataset, since it's already a little on the small side, and I want the SALT parameters all coming from the same source for consistency, so my inclination would be to not change them if I can't find them for the whole sample.

For the x0 parameter, the SDSS ones come from the Sako et al. dataset, while I actually calculated the SNLS ones myself. The SNLS source I used provided x1, c, a redshift, and a peak rest-frame B-band magnitude, so I used SNCosmo to calculate the x0 parameter from those.

OK there are a few issues tangled up here:

Consistency: Well, the methodology you describe actually obtains parameters from multiple sources here, so, it is not consistent in the way you desire. In fact, it is probably worse than you imagine (but I would not blame you for that). Here are some of the issues involved:

Sullivan et al. used an older version of the SALT2 model (Guy10) than what is the default model (Betoule et. al., 2014) in SNCosmo and was used in JLA, or the SDSS data release (Sako et. al., 2014). So, while the parameters x1 and c have the same names, they do not mean the same things (and is a bit like measuring the same thing with two different but similar units).
- JLA values are better than the Sullivan et al. values. People had just learned more by then. In particular, we had a slightly different model, but more importantly, improvements in calibration that shifted the parameters in correlated ways leading to important differences in cosmology. I do not necessarily believe that this particular improvement will necessarily be important for you as well.
- As far as the JLA sample of SNLS supernovae goes, it is almost the same as the Sullivan sample. There were a handful (~ 4) SNLS + low z SN that did not make the JLA sample.
- x0 as defined in SDSS (Sako et. al., 2014) is about a factor of 1.3 different from x0 defined in JLA/SNCosmo . This is due to conventions, not due to differences in software. So, you have to be careful in collating x0 values from SDSS, SNCosmo and SNLS/JLA sources. A possible out, is using that factor to modify the SDSS SN parameters to make them equivalent to JLA/SNCosmo.
- The data SDSS SN that are photometrically identified tend to be of a much lower quality than the other SNIa. It certainly includes several CC SN as is apparent by the Hubble Diagram of the sample. Selection biases are a tricky question, particularly in ML enterprises, and I think this should be addressed separately. I would assume that aside from justifying selection bias answers, you will be asked why you did not take photometrically identified SNLS SNIa (I think that is Bazin et. al, 2009) since you did take SDSS SNIa of the kind. (Don't know if Bazin had host z, that may be a good distinction).

I believe everyone involved will agree to the above statements, they are not very controversial. Still, I think the information about the changes is hardly posted out there in a way anyone can obtain it. This is one of the problems we have.

x0 or Mabs ?

To get a feel for this question, I would like to see a plot of z, vs {x0, x1, c} for a set of SN with parameters and abundances drawn from a galaxy distribution. I can help on this if you want.

tholoien commented 8 years ago

Alright, well it sounds like we definitely should update the dataset. It seems like the easiest thing to do would be to use only the JLA values to start with. That gives us mb, x1, c, and z, and we can calculate x0 using SNCosmo. From reading your message, it sounds like this would give us a consistent sample with both SDSS and SNLS supernovae that uses the latest version of the SALT2 model, but correct me if I'm wrong.

That gives us SN parameters for ~500 SDSS supernovae and ~230 SNLS supernovae. However, when I looked up the SNLS hosts in the SDSS catalog, I was only able to get host photometry for ~160 of them. I think it is important for us to have consistent host photometry, so if we can get by with only these SNLS ones, that would be better in my opinion than trying to add more from another source and making them consistent with the SDSS measurements somehow. The nice thing about the SDSS host photometry is that they modeled the host profiles in a consistent way and provide all the parameters (effective radius, model magnitudes, axis ratio, etc.), so it's easy to then use these to fit a model.

Using the sample of supernovae with both JLA SALT2 parameters and SDSS host photometry would give us the most consistent set. However, this would leave us with around 650 supernovae, give or take. First step is probably to try fitting the model with these, and seeing how it does. It's possible that this is enough to get something that works.

If it isn't, probably the best thing to do is add the SDSS photometric sample and try to adjust the SALT2 x0 as you suggested, since that seems to be an easier adjustment than trying to get more consistent host photometry. But like I said above, I think the first step is to get the set of ~650 SNe with good SALT2 parameters and host photometry and try fitting the model with that, unless you guys think that it will just no be enough data.

What do you think?

rbiswas4 commented 8 years ago

I agree that getting the ~650 SNe from JLA is a good first step.

Here is a different (link to the table that will hopefully be useful) [http://supernovae.in2p3.fr/sdss_snls_jla/jla_likelihood_v4.tgz] . The relevant file is 'data/jla_lcparams.txt' and that should give you the SALT2 params (mB). This is already a csv file as opposed to the vizier link I had pointed you to, so it should be easier.

This does not solve the problem of m_B vs x_0 which you had mentioned. So, if you are calculating x_0 by setting the BessellB rest frame value to mB. and getting x_0 , that is fine, but do use the MW extinction. Finally, do note (even if this is not corrected for), that the SNLS transmission model used in JLA has a position dependence on the focal plane position (so this might be hard to model in SNCosmo unless it has been done). An alternative is to fit all of these [light curves (http://supernovae.in2p3.fr/sdss_snls_jla/jla_light_curves.tgz) using SNCosmo : they are in SALT format, which can be read with SNCosmo.

On the question of whether 650 SN is enough, I have very little intuition. But,

If you decide to add the extra SDSS light curves, you can either fit them with SNCosmo, or do the correction I suggested after checking that this is still true. A way would be to match the x0 values in JLA with the values in SDSS for SDSS SNe.
If you want to add the additonal SNLS SNe, please send me a list of SNIDs that you are not getting host photometry for (and a few for which you have a source, and the source). I don't have any expertise on host photometry, but can try asking around.

tholoien commented 8 years ago

Hi Rahul,

Just to clarify, I need to correct the m_B for Galactic extinction before calculating x0? Just want to make sure I do things correctly.

rbiswas4 commented 8 years ago

Hi Tom,

I would use the correct SALT model from SNCosmo like in here: http://sncosmo.readthedocs.io/en/v1.3.x/models.html#creating-a-model-with-a-source-and-effect-s The important part is to keep the MW extinction and read it off from the SFD maps for the ra, dec of the SN (the default host extinction being 0 is OK). You should be able to setx1, c, z and then you can either fit for x0 as the only varied parameter, or (if I understand what you are doing) use the set_source_peakabsmag and use a chosen cosmology in the cosmo parameter of this method, and in calculating the distance modulus for changing mB to MB. Then you can use model.get('x0')

tholoien commented 8 years ago

It looks like fitting the x0 parameter from the light curve is the way to go, as sncosmo doesn't seem to incorporate the MW dust into the calculation of x0 if I set the absolute magnitude directly. I am going to have to add the MEGACAM bandpasses to sncosmo to do the SNLS fits. Do you anticipate any problems with doing this, since it won't be incorporating the focal plane position as they did in their analyses?

rbiswas4 commented 8 years ago

I just atted you on a PR that is meant for the radial function. If they can get that ready in a day or so, I would suggest waiting just to use the right thing, otherwise we should go ahead as you suggest. There will be a small difference due to the throughputs, but I would not expect that to change the world for what you are doing.

The megacam bandpasses are found in a number of locations: including https://github.com/lsst/throughputs. You probably want to use the most 'official megacam' source! It would be good to add it into SNCosmo as a PR (and share it with the world) rather than keeping it to this project only. Let us see the reply to the comment ....

tholoien commented 8 years ago

So based on what Kyle said it sounds like we should move forward with the average bandpasses for now, and update to the proper ones later once SNCosmo has been updated. Agreed?

rbiswas4 commented 8 years ago

Exactly! Use the i band rather than i2 in MegaCam. I am dealing with the same problem in another repository, though the simultaneity is not entirely coincidental :) .

tholoien commented 8 years ago

Okay, well I fit all the SDSS and SNLS JLA light curves and computed x0 for each one, and calculated an uncertainty using bootstrapping. I then updated the datasets to use only the JLA parameters (with x0 that I fit myself), and ran into a problem: the model is running into an error whenever I try to fit with more than 2 components. Run the demo notebook to see what I mean---the BIC test fails when it gets to 3 components, and no higher numbers work either.

I have seen this happen before, with the old dataset, but not until n_components~10 or so. My guess is that it has something to do with the fairly large number of parameters being used in the fit and the small number of supernovae in the dataset. The old dataset had 161 SNLS supernovae and 1273 SDSS supernovae, as it included all the photometrically identified SNe. The JLA set isn't even using all the spectroscopically identified SDSS SNe, so the current dataset has 159 SNLS supernovae and 317 SDSS supernovae --- not even 500 total. (This accounts for all SNe from SNLS and SDSS that have x1 and c values from JLA and SDSS host photometry.) I actually just now noticed an error I had made that caused me to exclude more SNe than I needed to, but fixing this won't dramatically change the numbers with just the JLA set.

The error is coming from the AstroML fitting code, not my code, and like I said I think it is being caused by the size of the dataset. I think we may need to incorporate the photometrically identified SDSS supernovae at this point, since I know fitting works with larger datasets and I'm not sure how to go about trying to fix the error otherwise. A bit of spot checking of the SNe that are in the JLA sample indicates that the x0 values fit by Sako et al. are off by a factor of ~1.2-1.3 from the ones I calculated, as Rahul noted above. I'm thinking I will just calculate a mean offset from the JLA supernovae and adjust the other Sako values accordingly, rather than re-fitting all the light curves. (Also, I don't have the SDSS light curves and I'm not sure where to get them---the data link on the Sako et al. arXiv page appears to be dead.)

If either of you has any other suggestions, let me know.

rbiswas4 commented 8 years ago

and 317 SDSS supernovae --- not even 500 total. Yes, the JLA dataset (since it was meant to do cosmology) had fairly stringent cuts that are described in the paper. In particular, the SNR cuts and the redshift cut result in the small SDSS part of the sample.

Run the demo notebook to see what I mean---the BIC test fails when it gets to 3 components, and no higher numbers work either.

Will try running it, but if I understand you correctly, you are saying that maximizing the likelihoods is resulting in a failure because with too few data points and many components, the likelihood to too fuzzy. But not at the step of calculating BIC. Is that correct?

It is good to know what the minimal data set size is for such components. If you use the SDSS light curves, yes, you can change the x0 factors. The factor that I would use is

10.0**(0.4 * 0.27)

Do you have a feel for whether including all the spectroscopically identified SNe from SDSS would help?

tholoien commented 8 years ago

Some kind of error is coming out in the fitting process, which is part of the BIC test (it does a fit with different values of n_components, then calculates the BIC). So yeah, it's the fitting that is having a problem, not the step of calculating BIC. There is one thing I want to try before committing to adding more SNe to the datasets, but if that doesn't work then I think I have to add more, since it seems to fix the problem.

Not sure whether adding the other spectroscopically ID'ed SDSS SNe will be enough to help, but I was going to add those first, try again, then add the photometric ones after that if needed.

One thing I noticed last night (the error I wrote about) was that the SDSS host magnitudes from Sako et al. have been de-reddened through some kind of modeling, which I think incorporates both MW and host extinction. I have been using the model magnitudes of the SNLS hosts directly, so my error was not correcting for MW extinction earlier, and I will fix that. This has me thinking though, should I be correcting the SDSS host mags myself in the same way, rather than using the ``de-reddened'' ones from Sako et al.? The Sako et al. mags are probably more accurate, but this would introduce another inconsistency into the dataset. So is it worth it to be consistent with how all the hosts are corrected for extinction, even if it's not as good of a correction? Thoughts?

tholoien commented 8 years ago

Did a bit more testing. It does seem to be the number of items in the dataset, and adding the other spectroscopically ID'ed SDSS SNe does work a bit better (it's able to get up to 4 components to work), but it looks like we're going to need the photometrically ID'ed ones too.

Anyone want to chime in on the extinction correction? I'm leaning towards correcting the model mags myself to account for Galactic extinction only, but could see the value in going the other way too.

rbiswas4 commented 8 years ago

I do not have any useful intuition on this. What you suggested sounds quite reasonable and I would probably try the same thing ....

On Sep 21, 2016, at 11:41 AM, Tom Holoien notifications@github.com wrote:

Did a bit more testing. It does seem to be the number of items in the dataset, and adding the other spectroscopically ID'ed SDSS SNe does work a bit better (it's able to get up to 4 components to work), but it looks like we're going to need the photometrically ID'ed ones too.

Anyone want to chime in on the extinction correction? I'm leaning towards correcting the model mags myself to account for Galactic extinction only, but could see the value in going the other way too.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

tholoien commented 8 years ago

Okay, I finally figured out this error, after a large amount of testing and digging around. The error I was getting comes from scikit-learn's score_samples method, and it occurs when there are too few data points that map to a particular component in the GMM model. This happens I think when you have a Gaussian in the model that gets fit to a small number of outlying data points, and thus none of the other data points get mapped to that component and it throws this error.

It turns out that this wasn't just a case of too few data points. For some reason, when I fit the JLA light curves one of the SDSS SNe had an x0 value of -6.55, which is clearly a big outlier, if not impossible. Consequently, when the number of Gaussians in the model got high enough, it was fitting a component to this single data point, and that was the one that was causing the errors. I took this one out, and the model ran without error up to 7 Gaussian components. I've tried refitting this a few times, but always end up with a similar answer, so I'm just going to exclude it.

I've added in the remaining spectroscopically identified SDSS SNe to give us a slightly bigger sample, and it now is able to run a BIC test from 1-8 components without issue. (See the demo, which I will update in full later.) The BIC test is now preferring a much smaller number of components (3) than it had previously, so I will do a little more testing to see what the optimal number is. I think 3 will probably be a little too low, but it's likely going to be fewer than it was before. Then I can run the demo and see how the new model works with the smaller number of data points. Now that I got rid of that bad SDSS data point, I should be able to easily add more data later, if we decide we want to do that.

I corrected all the model magnitudes for both datasets for Galactic extinction myself, as mentioned above. The differences in the SDSS mags vs. the de-reddened ones from Sako et al. are pretty minor, so it's not really a big change, but now we can say it's consistent. Thinking about it more, I don't think there was any "host extinction" correction in the Sako et al. de-reddened mags, since we are talking about galaxy light as a whole, so that wouldn't make a lot of sense.

When I was looking at the SDSS supernovae that are in the JLA catalog, I noticed that he x1 values from JLA also differ quite a bit from the x1 values reported in Sako et al. for the same supernovae, sometimes by quite a lot. The c values are essentially the same, so this does seem to be an actual change. Do you think this is something we need to worry about/try to correct for in the spectroscopic SDSS SNe that aren't in JLA?

rbiswas4 commented 8 years ago

It turns out that this wasn't just a case of too few data points. For some reason, when I fit the JLA light curves one of the SDSS SNe had an x0 value of -6.55,

It is physically impossible. Now, small amounts of physical impossibility are sometimes possible in some analyses because they don't matter much, but this is too large. Are you sure this was in the JLA sample ? In that case something must be wrong (ie. the light curve you are downloading etc. ) Could you tell me the source for the light curve and the ID? (If this sort of thing happened in the SDSS supernovae from the data release I would not be as surprised)

When I was looking at the SDSS supernovae that are in the JLA catalog, I noticed that he x1 values from JLA also differ quite a bit from the x1 values reported in Sako et al. for the same supernovae, sometimes by quite a lot. The c values are essentially the same, so this does seem to be an actual change. Do you think this is something we need to worry about/try to correct for in the spectroscopic SDSS SNe that aren't in JLA?

This is correct. The change is due to fitting with different codes (numerical systematics) and is much less than what it used to be. The correct way to think about comparing c and x1 deviations is to weight c values by ~ 3 though. I would not worry about it at this point and would move on to see exactly what the model is predicting, but you could say that in order to be consistent one should fit everything together.

tholoien commented 8 years ago

This is the supernova SDSS14331. It's one of the ones that I was fitting the light curve for myself, so definitely in the JLA sample. When I use the given x1 and c parameters and fit the light curve for x0 I get the same value (-6.55) each time. I got the light curves and SALT2 parameters from the links in your 2nd comment in this thread, same as all the other SNLS and SDSS JLA ones.

rbiswas4 commented 8 years ago

Thanks ... I will try investigating this!

rbiswas4 commented 8 years ago

@tholoien I am unable to reproduce this. Here is what I have: https://github.com/rbiswas4/AnalyzeSN/blob/master/examples/Check_SDSS14331_SALT2Format.ipynb Can we figure out what we are doing differently?

tholoien commented 8 years ago

I wasn't using the modelcov=True flag in my fit. When I do that, my results match yours. I will redo the fits when I get a chance to make sure they're all correct.

rbiswas4 commented 8 years ago

I would advise using modelcov=True, though it has its own set of problems elsewhere. Another thing to sometimes do is specify the minsnr parameter explicitly.

But, in this case I tried out modelcov=False and did not get the negative value either. I did not expect modelcov to be the reason, because it usually only has big impacts at the edge of the phase space (extreme values of time wrt to peak or wavelength values that are closer to the edge of the SALT model bounds, where the model is not well trained in these regions) and the light curve shows it as a well sampled SN. Maybe in the fit you did before, the time you got back is really wrong causing this? The only reason I am asking these questions is that if there was no typo but we get bad results without modelcov, it might be an anomalous case (which are always interesting to enumerate) you have stumbled on.

tholoien commented 8 years ago

Ah, yes, I get t0=54087 when I run it with modelcov=False, vs. t0=54013 with modelcov=True. I definitely still get the weird results with modelcov off when I run it now, but I'm only setting x1, z, c, and mwebv and I'm fitting for x0 and t0.

One question I had, do we want to be using CMB-frame redshifts, or heliocentric ones? Every time I look this up I end up thinking differently, so thought it would be good to ask.

rbiswas4 commented 8 years ago

That is interesting and thanks for isolating this example!

Quick answer to your z question:

For light curve fits: heliocentric, because what you are looking for is converting from observer frame to rest frame for wavelengths.

On Sep 22, 2016, at 4:57 PM, Tom Holoien notifications@github.com wrote:

Ah, yes, I get t0=54087 when I run it with modelcov=False, vs. t0=54013 with modelcov=True. I definitely still get the weird results with modelcov off when I run it now, but I'm only setting x1, z, c, and mwebv and I'm fitting for x0 and t0.

One question I had, do we want to be using CMB-frame redshifts, or heliocentric ones? Every time I look this up I end up thinking differently, so thought it would be good to ask.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

tholoien commented 8 years ago

Okay, I finalized the dataset to include the freshly fit JLA light curves for the JLA supernovae and all other spectroscopic and photometric SDSS SNe, with corrected x0 values. I am using heliocentric redshift, as I had been, and as Rahul suggested.

The reason I decided to include all the photometrically ID'ed SNe was that when I ran through the demo, there just wasn't a large enough sample to be able to do the train/test split and tests that we're doing. So, while the spectroscopically ID'ed and JLA supernovae are enough to fit a model, I think it's too few SNe to build a working predictor tool.

I have rerun the PlotCorr and demo notebooks to use the new datasets. As you can see from the demo, the tool isn't quite as good at replicating the training distribution from the test distribution as it was before updating the data, but it's pretty close. The BIC test is a little weird, but some testing on my own indicates 6 components seems to be a good number. In the end, I think this is a fairly decent tool, given the data we are putting into it, and I think it should be useful for TWINKLES. Somewhere down the line if we get a nice big set of DES supernovae (or something like that) with consistent light curve and host parameters, I expect the model would become more accurate.

I'll be uploading a new 6-component model to the Models folder when I'm done fitting one.

tholoien / empiriciSN

EmpericiSN methodology and datasets #23

`x0 values used.`

x0 or Mabs ?