Open anetasie opened 7 years ago
I was looking at this last week, and it's not clear to me what is actually going on, as I originally was worried about this, but now am not so sure. Internally DataSimulFit
and SimulFitModel
instances are created, based on the loaded data, and passed to the sampler.
I am concerned that if you have multiple datasets loaded but only want to use a subset then it doesn't work.
Also, this code is somewhat hard to follow, so I could well be mis-reading it.
At this moment only 1 data id is allowed as a subset, which is a limitation, so it is all used for the simultaneous fit or only 1.
Is this issue still relevant?
I believe it is still relevant. Say I have 3 datasets loaded and I only want the errors on the flux using datasets 2 and 3. I do not think we can do this with the current sample_flux
function. We could just say in the documentation "don't do this", but I expect we want to support this use case.
Just a note that we had a user hit this problem and send in a report via cxc helpdesk (ticket 21004).
I have 4 observations of a source in Haro 11 (spanning years 2006-2017) and I want to see if the flux has changed. The spectrum looks quite different at the low energies, but contaminant.
I fit a simple model to each spectrum (xswabs*powlaw) and ran sample_flux to get an idea of the flux plus uncertainty.
sherpa-60> flux1=sample_flux(comp1,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.1421e-14, + 9.34095e-15, - 7.30664e-15 model component flux = 9.1421e-14, + 9.34095e-15, - 7.30664e-15 sherpa-61> flux2=sample_flux(comp2,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.34246e-14, + 8.08303e-15, - 9.67405e-15 model component flux = 1.01498e-13, + 1.48371e-14, - 1.82864e-14 sherpa-62> flux3=sample_flux(comp3,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.16655e-14, + 9.88112e-15, - 8.59698e-15 model component flux = 8.081e-14, + 1.56116e-14, - 1.635e-14 sherpa-63> flux4=sample_flux(comp4,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.28522e-14, + 6.74755e-15, - 8.60058e-15 model component flux = 7.1062e-14, + 1.37082e-14, - 1.18501e-14
Here's the problem......for observations 2,3, and 4 the "original model flux" is different from the "model component flux". But there is only one model, so I would expect them to be the same (this is the case for flux1).
Calc_energy_flux gives a number that is very close to the model component flux:
sherpa-71> calc_energy_flux(0.3,8.0,id=1) 9.4173293455141918e-14 sherpa-72> calc_energy_flux(0.3,8.0,id=2) 1.046753057933798e-13 sherpa-73> calc_energy_flux(0.3,8.0,id=3) 8.9064748825681024e-14 sherpa-74> calc_energy_flux(0.3,8.0,id=4) 7.7509429111129888e-14
I have a copy of the files locally in /data/lenin2/HelpDeskPub/21004
Based on this issue, I asked the user to load datasets individually
Since you're models are all independent and all the datasets are independent (no linked parameters), then as a work-around you should be able to fit each of the 4 spectra individually. [Since the models are independent, you should get statistically equivalent results fitting each dataset individually.]
which they reported worked correctly
OK, when I do this, I get sensible answers. Something odd is happening with sample_flux and multiple datasets.
I think this also holds for sample_energy_flux
and sample_photon_flux
.
There are some usability issues to think about related to simultaneous fits that do not use the same model - e.g.
set_source(1, xsphabs.gal * powlaw1d.pl)
set_source(2. gal * pl * const1d.scale2)
which allows the normalization to vary (e.g. to account for cross-instrument calibration). It may just be a case of documentation, but I'm not completely convinced at this point.
I've added support for multiple datasets to sample_energy_flux
and sample_photon_flux
(which are used by sample_flux
and so need to be updated) as part of #803. It should be easy to update sample_flux
.
Actually, I don't believe the claim in @anetasie initial point - that covar is only being run on a single dataset - is correct. if you run with id=None then the fit and error analysis is done on all datasets (just as it would be if you call fit() and covar()).
With the code below, I fit multiple datasets, report the covariance matrix and the square-root of the diagonal [since this gets used when sampling the errors later], and the use sample_energy_flux
with id=None
and id=1
. This is with the master branch which I have changed so that the errors that are used to calculate the samples are output (as "DBG: scales for random sampling:" lines). My conclusion is that sample_energy_flux
(and hence sample_flux
which just calls on sampe_energy_flux
) does use the results of a simultaneous fit as long as you keep id=None
, which is the default setting.
The relevant results are:
a) the results from fit and covar for the multiple datasets
covariance 1-sigma (68.2689%) bounds:
Param Best-Fit Lower Bound Upper Bound
----- -------- ----------- -----------
pl.gamma 1.64289 -0.139615 0.139615
pl.ampl 1.3219e-06 -1.3009e-07 1.3009e-07
Fit covariance:
[[1.94922475e-02 1.01390251e-08]
[1.01390251e-08 1.69232880e-14]]
As 1D errors:
[1.39614639e-01 1.30089538e-07]
So the covariance errors (ignoring correlation) are 0.14, 1.3e-7
b) the results from sample_energy_flux
We see the results from the id=None
and then id=1
calls. The id=None
mentions all three datasets and then uses the error=0.14, 1.3e-7 values to create the samples. These are valid for the combined analysis.
Multiple datasets (id=None)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
WARNING: data set 3 has associated backgrounds, but they have not been subtracted, nor have background models been set
WARNING: data set 4 has associated backgrounds, but they have not been subtracted, nor have background models been set
DBG: scales for random sampling: [1.39614639e-01 1.30089538e-07]
The id=1
version only mentions a single dataset and uses significantly larger errors when creating the samples:
Single dataset (id=1)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
DBG: scales for random sampling: [2.55060289e-01 2.21343632e-07]
Note that if I use covar(1)
to evaluate the errors for dataset 1 (using the best-fit for the combined dataset) we see the 0.255, 2.2e-7 errors that are reported in my debug lines:
>>> covar(1)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
Dataset = 1
Confidence Method = covariance
Iterative Fit Method = None
Fitting Method = levmar
Statistic = cstat
covariance 1-sigma (68.2689%) bounds:
Param Best-Fit Lower Bound Upper Bound
----- -------- ----------- -----------
pl.gamma 1.64289 -0.25506 0.25506
pl.ampl 1.3219e-06 -2.21344e-07 2.21344e-07
Here's the code
import numpy as np
from matplotlib import pyplot as plt
from sherpa.astro.ui import *
mdl = xswabs.gal * powlaw1d.pl
for did in range(1, 5):
if did == 2:
# skip 2 as it is different to the others
continue
load_pha(did, 'sherpa-test-data/sherpatest/obs{}.pi'.format(did))
# subtract(did)
set_source(did, mdl)
notice(0.5, 7)
set_stat('cstat')
gal.nh = 0.0393
gal.nh.freeze()
pl.gamma = 2.027
pl.ampl = 1.96e-4
fit()
covar()
cov = get_covar_results().extra_output
print("Fit covariance:")
print(cov)
print("As 1D errors:")
print(np.sqrt(cov.diagonal()))
n = 100
print("\nMultiple datasets (id=None)")
x1 = sample_energy_flux(lo=0.5, hi=7, id=None, num=n)
print("\nSingle dataset (id=1)")
x2 = sample_energy_flux(lo=0.5, hi=7, id=1, num=n)
This is not to say that we shouldn't add an otherids
argument (as we do with other calls), so users can be explicit about what they want. I was just a bit surprised when I found out that id=None
works as people would want it to.
@DougBurke covar()
always works on all data sets. I have to go back and check what exactly was the issue I reported here. However, there was some work on calculating errors on eqwidth()
and perhaps things were changed at that time. Anyway, good that the id=None
does what it supposed to do.
I was under the impression that this issue was investigated at one of the L3 projects. As I remember it, I had suggested that one can test by mixing the order of the simultaneous data (if covar
only uses the first data set then the results should be different etc...)
sample_flux uses covar() to set the size of the multivariate gaussian for the simulations. In the current implementation covar() is run on only 1 data set. This is incorrect if the fit was performed on the multiple data sets. sample_flux() needs to expanded to use multiple ids as an input to the function call and then calculate covar/conf using these ids.