sherpa / sherpa

Fit models to your data in Python with Sherpa.
https://sherpa.readthedocs.io
GNU General Public License v3.0
155 stars 51 forks source link

sample_flux for multiple data sets #404

Open anetasie opened 7 years ago

anetasie commented 7 years ago

sample_flux uses covar() to set the size of the multivariate gaussian for the simulations. In the current implementation covar() is run on only 1 data set. This is incorrect if the fit was performed on the multiple data sets. sample_flux() needs to expanded to use multiple ids as an input to the function call and then calculate covar/conf using these ids.

DougBurke commented 7 years ago

I was looking at this last week, and it's not clear to me what is actually going on, as I originally was worried about this, but now am not so sure. Internally DataSimulFit and SimulFitModel instances are created, based on the loaded data, and passed to the sampler.

I am concerned that if you have multiple datasets loaded but only want to use a subset then it doesn't work.

Also, this code is somewhat hard to follow, so I could well be mis-reading it.

anetasie commented 7 years ago

At this moment only 1 data id is allowed as a subset, which is a limitation, so it is all used for the simultaneous fit or only 1.

olaurino commented 6 years ago

Is this issue still relevant?

DougBurke commented 6 years ago

I believe it is still relevant. Say I have 3 datasets loaded and I only want the errors on the flux using datasets 2 and 3. I do not think we can do this with the current sample_flux function. We could just say in the documentation "don't do this", but I expect we want to support this use case.

kglotfelty commented 6 years ago

Just a note that we had a user hit this problem and send in a report via cxc helpdesk (ticket 21004).

I have 4 observations of a source in Haro 11 (spanning years 2006-2017) and I want to see if the flux has changed. The spectrum looks quite different at the low energies, but contaminant.

I fit a simple model to each spectrum (xswabs*powlaw) and ran sample_flux to get an idea of the flux plus uncertainty.

sherpa-60> flux1=sample_flux(comp1,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.1421e-14, + 9.34095e-15, - 7.30664e-15 model component flux = 9.1421e-14, + 9.34095e-15, - 7.30664e-15 sherpa-61> flux2=sample_flux(comp2,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.34246e-14, + 8.08303e-15, - 9.67405e-15 model component flux = 1.01498e-13, + 1.48371e-14, - 1.82864e-14 sherpa-62> flux3=sample_flux(comp3,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.16655e-14, + 9.88112e-15, - 8.59698e-15 model component flux = 8.081e-14, + 1.56116e-14, - 1.635e-14 sherpa-63> flux4=sample_flux(comp4,lo=0.3,hi=8.0,Xrays=True,correlated=True,num=100) original model flux = 9.28522e-14, + 6.74755e-15, - 8.60058e-15 model component flux = 7.1062e-14, + 1.37082e-14, - 1.18501e-14

Here's the problem......for observations 2,3, and 4 the "original model flux" is different from the "model component flux". But there is only one model, so I would expect them to be the same (this is the case for flux1).

Calc_energy_flux gives a number that is very close to the model component flux:

sherpa-71> calc_energy_flux(0.3,8.0,id=1) 9.4173293455141918e-14 sherpa-72> calc_energy_flux(0.3,8.0,id=2) 1.046753057933798e-13 sherpa-73> calc_energy_flux(0.3,8.0,id=3) 8.9064748825681024e-14 sherpa-74> calc_energy_flux(0.3,8.0,id=4) 7.7509429111129888e-14

I have a copy of the files locally in /data/lenin2/HelpDeskPub/21004

Based on this issue, I asked the user to load datasets individually

Since you're models are all independent and all the datasets are independent (no linked parameters), then as a work-around you should be able to fit each of the 4 spectra individually. [Since the models are independent, you should get statistically equivalent results fitting each dataset individually.]

which they reported worked correctly

OK, when I do this, I get sensible answers. Something odd is happening with sample_flux and multiple datasets.

DougBurke commented 4 years ago

I think this also holds for sample_energy_flux and sample_photon_flux.

DougBurke commented 4 years ago

There are some usability issues to think about related to simultaneous fits that do not use the same model - e.g.

set_source(1, xsphabs.gal * powlaw1d.pl)
set_source(2. gal * pl * const1d.scale2)

which allows the normalization to vary (e.g. to account for cross-instrument calibration). It may just be a case of documentation, but I'm not completely convinced at this point.

DougBurke commented 4 years ago

I've added support for multiple datasets to sample_energy_flux and sample_photon_flux (which are used by sample_flux and so need to be updated) as part of #803. It should be easy to update sample_flux.

DougBurke commented 4 years ago

Actually, I don't believe the claim in @anetasie initial point - that covar is only being run on a single dataset - is correct. if you run with id=None then the fit and error analysis is done on all datasets (just as it would be if you call fit() and covar()).

DougBurke commented 4 years ago

With the code below, I fit multiple datasets, report the covariance matrix and the square-root of the diagonal [since this gets used when sampling the errors later], and the use sample_energy_flux with id=None and id=1. This is with the master branch which I have changed so that the errors that are used to calculate the samples are output (as "DBG: scales for random sampling:" lines). My conclusion is that sample_energy_flux (and hence sample_flux which just calls on sampe_energy_flux) does use the results of a simultaneous fit as long as you keep id=None, which is the default setting.

The relevant results are:

a) the results from fit and covar for the multiple datasets

covariance 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   pl.gamma          1.64289    -0.139615     0.139615
   pl.ampl        1.3219e-06  -1.3009e-07   1.3009e-07
Fit covariance:
[[1.94922475e-02 1.01390251e-08]
 [1.01390251e-08 1.69232880e-14]]
As 1D errors:
[1.39614639e-01 1.30089538e-07]

So the covariance errors (ignoring correlation) are 0.14, 1.3e-7

b) the results from sample_energy_flux

We see the results from the id=None and then id=1 calls. The id=None mentions all three datasets and then uses the error=0.14, 1.3e-7 values to create the samples. These are valid for the combined analysis.

Multiple datasets (id=None)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
WARNING: data set 3 has associated backgrounds, but they have not been subtracted, nor have background models been set
WARNING: data set 4 has associated backgrounds, but they have not been subtracted, nor have background models been set
DBG: scales for random sampling: [1.39614639e-01 1.30089538e-07]

The id=1 version only mentions a single dataset and uses significantly larger errors when creating the samples:

Single dataset (id=1)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
DBG: scales for random sampling: [2.55060289e-01 2.21343632e-07]

Note that if I use covar(1) to evaluate the errors for dataset 1 (using the best-fit for the combined dataset) we see the 0.255, 2.2e-7 errors that are reported in my debug lines:

>>> covar(1)
WARNING: data set 1 has associated backgrounds, but they have not been subtracted, nor have background models been set
Dataset               = 1
Confidence Method     = covariance
Iterative Fit Method  = None
Fitting Method        = levmar
Statistic             = cstat
covariance 1-sigma (68.2689%) bounds:
   Param            Best-Fit  Lower Bound  Upper Bound
   -----            --------  -----------  -----------
   pl.gamma          1.64289     -0.25506      0.25506
   pl.ampl        1.3219e-06 -2.21344e-07  2.21344e-07

Here's the code

import numpy as np
from matplotlib import pyplot as plt

from sherpa.astro.ui import *

mdl = xswabs.gal * powlaw1d.pl

for did in range(1, 5):
    if did == 2:
        # skip 2 as it is different to the others
        continue

    load_pha(did, 'sherpa-test-data/sherpatest/obs{}.pi'.format(did))
    # subtract(did)
    set_source(did, mdl)

notice(0.5, 7)

set_stat('cstat')

gal.nh = 0.0393
gal.nh.freeze()

pl.gamma = 2.027
pl.ampl = 1.96e-4

fit()

covar()
cov = get_covar_results().extra_output
print("Fit covariance:")
print(cov)

print("As 1D errors:")
print(np.sqrt(cov.diagonal()))

n = 100

print("\nMultiple datasets (id=None)")
x1 = sample_energy_flux(lo=0.5, hi=7, id=None, num=n)

print("\nSingle dataset (id=1)")
x2 = sample_energy_flux(lo=0.5, hi=7, id=1, num=n)
DougBurke commented 4 years ago

This is not to say that we shouldn't add an otherids argument (as we do with other calls), so users can be explicit about what they want. I was just a bit surprised when I found out that id=None works as people would want it to.

anetasie commented 4 years ago

@DougBurke covar() always works on all data sets. I have to go back and check what exactly was the issue I reported here. However, there was some work on calculating errors on eqwidth() and perhaps things were changed at that time. Anyway, good that the id=None does what it supposed to do.

dtnguyen2 commented 4 years ago

I was under the impression that this issue was investigated at one of the L3 projects. As I remember it, I had suggested that one can test by mixing the order of the simultaneous data (if covar only uses the first data set then the results should be different etc...)