Refactoring the instrument data handling

This tackles a subset of the points raised in #81 where we define the format of the instrument data to be fitted.

The aim is to ensure that the loading of the data and creation of the spectrum to be fitted is made modular in relation to the rest of the package. This will make the code easier to maintain and allow for easier interactions with instrumental data.

Discussed by @KriSun95 and @settwi (23/10/2023) while focussing on the "Observational data" section of the basic flow chart in the DASH 2023 Sunxspex poster (also shown here). sunxspex_basic_flow_chart

Defining basic information needed for fitting

The first point is to define what information is explicitly needed for the X-ray forward-fitting process, as well as the appropriate format and units. This basic and fundamental list includes:

counts
- 1D, label:counts, unit:counts
counts error
- _1D (same length as counts), label:counts_error, unit:counts_
count energy bin edges
- _1D, label:counts_energy edges, unit:keV_
photon energy bin edges
- _1D, label:photons_energy edges, unit:keV_
spectral response matrix
- _2D, label:spectral_response_matrix, unit:cm^2 count photon^-1_
effective exposure
- _scalar or 1D (same length as counts), label:effective_exposure, unit:s_

An instrument loader class will contain an event_data attribute with this base information (as a generic spectrum class, perhaps) in this format which will be passed to the fitting object. Additionally, if the user wishes to include a background spectrum, a background_data attribute will exist and will be of the same type as event_data but with the background data for the event spectrum instead.

The units will be assigned to the arrays using astropy.units and it is be the job of the specific instrument loader to either assign or convert the data to the appropriate units from the start. Note, the units will be stripped when passing to the fitter class.

All other information (e.g., rates, mid-point energies, etc.) will be calculated from the base information, will not be changeable without changing the base information, and be contained in the wider instrument loader class if not helpful during the fitting process.

Why have an instrument loader and generic spectrum class?

The instrument loader will contain all the information of the loaded data from a given observation (obtained via files or a dictionary-like object), this includes a generic spectrum class which contains only the necessary information for data-model comparison to be passed to the fitter class.

This means a user can still interact with the data that is loaded in anyway that the instrument loader will allow (e.g., plot the spectrum to determine energy fitting range, rebin the data, choose a different event time, etc.), while keeping the information needed for fitting simple and contained.

Therefore, an instrument class can be as complicated and unique as it needs to be for a given instrument and still work absolutely fine with everything (no slow down or moving unnecessary information about). The only thing that will be passed to the fitter object will be the generic spectrum class with the fundamental information needed for fitting under instrument_loader.event_data.

Defining a naming convention from the start

Moreover, a naming convention is defined for different rate-like values that may exist in the instrument class, and throughout the package, as

"something Rate" refers to something s^-1
"Differential something Rate" refers to something s^-1 keV^-1
"Differential something Flux" refers to something s^-1 keV^-1 cm^-2

(E.g., photon rate is photons s^-1, differential photon rate is photons s^-1 keV^-1, and differential photon flux is photons s^-1 keV^-1 cm^-2.

This avoids the confusion currently present in the code with the same names being used to represent different quantities (mentioned in #74).

Lastly, from where is the data sampled?

Thinking forward to the fitting, it is important to understand what fit statistics/likelihoods are appropriate for the data being loaded. However, it is not the instrument loader's responsibly to choose a specific fit statistic or likelihood for the later analysis, but it can be appropriate to describe the nature of the data the loader contains.

Therefore, this motivates the instrument loader class to contain information on from where the data is likely sampled. This then allows a warning, if needed, to be produced when a potentially inappropriate fit statistic or likelihood is chosen to optimise the fit over which can affect the result.

E.g., for STIX, it may not always be appropriate to fit using a statistic like c-stat if the error is not dominated by the Poissonian error.

Therefore, if a string exists in the instrument class like Gaussian, Poissonian, or other then this might help set a default statistic to be used for the loaded data and/or produce a warning if an ill-fitting statistic is selected to optimise/sampled during the fitting.

Moreover, the fit statistics and likelihoods may be separated into these regimes and we can just check if the one chosen is an instance of the appropriate category (or a similar check can be included).

E.g., RHESSI data is likely to be treated as Gaussian and so a warning should probably be produced if the user manually sets the fit statistic for fitting to be of a Poissonian nature like c-stat, cash, or poisson.

Additional information on the instrument refactor from today's discussion between @KriSun95 and @settwi (26/10/2023):

Instrument module structure

The instrument code has been moved to its own separate module (here). New files/structure in the instruments module are

base_instrument.py containing class BaseInstrument
- purpose is to create consistent interaction between different instrumental data for, e.g., spectral fitting.
base_spectral_response.py containing class BaseSpectralResponse
- create consistent interaction between different instrumental spectral response matrices for, e.g., spectral fitting.
spectrum_data.py containing class SpectrumData
- contains only the necessary information needed for spectral fitting and data-model comparison.

These three classes can then be used in a general way to create a loader for a given data source, for example

general_loader.py (a very rough example of this file lives here)
- IO/reader function
- SRM class inheriting from base_spectral_response.BaseSpectralResponse
- instrument loader class
- manages the SRM class (BaseSpectralResponse) and the spectrum data class which is a spectrum_data.SpectrumData instance
- for example, if the data is rebinned in count-space then the loader class will ensure the SRM and spectrum data classes remain consistent

For an instrument to work with the fitting code it should have an \<instrument>_loader.py file created with the defined API in mind (e.g., rhessi_loader.py, nustar_loader.py, etc.).

Additional items

Instrument loader should be able to show the loaded data.
- The scientist might want to know what they can do with the data.
Instrument loader should be able to alter the loaded spectrum (e.g., set the fitting range, etc.).
- The fitting process can be helped a lot by this.
Instrument class should generally make it easy to include additional error sources.

Future of the instruments module

The instrument specific stuff may be moved elsewhere in the future if/when a more suitable home is found, potentially on an individual basis. However, detailed work considering specific instruments and their data, like that discussed so far, is needed in order to set a robust and practical API that can be quickly and easily tested in realistic situations.

Additionally the loaded data in the Instrument should have its time bins accessible. That way the user can make a spectrogram in 1-2 lines of code should they wish.

It quite difficult to collaborate on this as an issue perhaps take the content of the issue and add a new section of the docs for design that we can comment on individual parts/lines or make a PR or both even as the design/documentation and implementation may go at different paces.

Really really like the idea of the defined names for the various types of quantities was struggling with this myself in stixpy may use the above there there too! Is the nomenclature used elsewhere of did you invent it?

Looking at the code and text above I think some these are implementation details vs the specification of the API. For example given the counts_energy_edges or spectral_edges the bin widths are defined. From an API point of view dos the API require that both edges and widths or only edges. In this specific case we could be flexible and if only the edges are given calculate the widths and if both supplied use the given widths but then run the risk of the given widths not matching the edges maybe there's a use case for that?

Something that was discussed before should the API support non-contiguous data e.g. given edges 1, 2, 3, 4 but only have data for bins 1-2 and 3-4? If yes then how should the implementation work two separate Spectrum objects for 1-2 and 3-4, use nans/masked with current array, ...?

I think at least in the base objects it may be beneficial to be more general in terms of the names/data handled but also move things which are not the data out of Spectrum.

Spectrum:
    counts -> data 
    counts_error -> error/uncertainty
    counts_energy_edges -> spectral_edges
    counts_energy_widths -> Remove is it not defined by counts_energy_edges/spectral_edges
    photons_energy_edges -> Move goes with model/SRM/fitting
    spectral_response_matrix -> Move goes in SpectrometerResponse / InstrumentResponse
    effective_exposure -> exposure/integration or remove depends on the units of the data

CountSpectrum(Spectrum):  #doesn't have to be inheritance can do same via composition
    effective_exposure/livetime
    effective_exposure
    area?
    responce = SpectralResponce/InstrumentReponce

In the linked code in the __init__ for Spectrum units are being assigned to the counts, errors, edges etc I think this could be dangerous and also not really the job of the Spectrum object. It could/should certainly check the validity of the units of the data passed in but not convinced changing/adding units is a good idea. Also for the error I could see passing in a scalar value as possibility?

I don't have a good answer but I'm not sure attaching the response to the Spectrum is the best idea. It breaks the single responsibly idea and also turns what was a data container into something more. I can see the usefulness but perhaps making a different object which combines and Spectrum and Spectral/Instrument response or moving it to the model side could be more flexible?

The simplest use case I could think of was fitting a model to data where the two are already in the same units maybe radio data for example.

flowchart TD
    A[Spectrum] --> B
    C[Model] --> B
    B[Fitter]

The next simplest use case is the standard use case fitting count data to a physical model via forward fitting where the conversion from physical units to counts if fixed e.g. SRM

flowchart TD
    A[CountSpectrum] --> B
    subgraph Fittable Paramters
    C[Model] --> B
    end
    D[SpectraResponce] --> B
    B[Fitter]

Finally the most complex use case I could think of is one where the SRM needs to be included the fits e.g a rate dependent DRM or unknown transmission info ....

flowchart TD
    A[CountSpectrum] --> B
    subgraph Fittable Paramters
    C[Model] --> B
    D[SpectraResponce] --> B
    end
    B[Fitter]

flowchart TD
    A[Data]--> B[SpectralResponce]
    A --> C[CountSpectrum]
    C --> D[Fitter]
    subgraph Fittable Parameters
    E[PhysicalModel] --> G{ForwardModel}
    B --> G
    end
    G --> D

What's not indicated on the diagrams and I haven't thought about enough is that you can have various mappings between the data + response and model/s fit all data to a single model, have pairs of data + models, fits each data + response par to different models (can't really see a use case for this)

Sorry for long and not very well organised comment

@samaloney thanks so much for taking the time to run through what's been written. These are really good points and I'm glad the discussion is opening up!

It quite difficult to collaborate on this as an issue perhaps take the content of the issue and add a new section of the docs for design that we can comment on individual parts/lines or make a PR or both even as the design/documentation and implementation may go at different paces.

I agree, it probably would be best to open a PR and create a design documentation section. The main goal here, initially, was to put the discussion somewhere and make it as public as possible. I'll work on moving this to a PR soon.

Really really like the idea of the defined names for the various types of quantities was struggling with this myself in stixpy may use the above there there too! Is the nomenclature used elsewhere of did you invent it?

The nomenclature was just decided after discussing a few different options. Mainly we wanted to assign a word for each unit, i.e., differential is keV^-1, rate is s^-1, and flux is s^-1 cm^-1, to make it as clear and consistent as possible. This is definitely open for discussion though, if we find something that makes more sense then I'm all for that. I'm trying to avoid making my previous mistake of forgetting what I called things and changing halfway through writing the code. Hopefully defining at the start will help this time.

Looking at the code and text above I think some these are implementation details vs the specification of the API. For example given the counts_energy_edges or spectral_edges the bin widths are defined. From an API point of view dos the API require that both edges and widths or only edges. In this specific case we could be flexible and if only the edges are given calculate the widths and if both supplied use the given widths but then run the risk of the given widths not matching the edges maybe there's a use case for that?

When it comes to the names of the data arrays (counts_energy_edges, etc.), this can also be up for debate (re: your later suggestion). It can definitely be changed such that the user can give different bin widths if they like otherwise calculate them from the edges. I can't really think of a case for providing different widths but providing the option wouldn't be difficult in the slightest and it might be better to have that available than to remove the option entirely.

Something that was discussed before should the API support non-contiguous data e.g. given edges 1, 2, 3, 4 but only have data for bins 1-2 and 3-4? If yes then how should the implementation work two separate Spectrum objects for 1-2 and 3-4, use nans/masked with current array, ...?

Yes. This might be a wider conversation since it will feed into fitting over different/non-contiguous energy ranges as well as instruments potentially having data-bins to skip over. I don't think this would be difficult to implement on the data and fitting side of things, the model side might be different. The way I have done this so far is just to index the obs. count and SRM array (contiguous or otherwise) to the chosen bins before comparing comparing to the indexed model array every iteration but I imagine that's more expensive than making the models handle bin gaps from the beginning.

I think at least in the base objects it may be beneficial to be more general in terms of the names/data handled but also move things which are not the data out of Spectrum.

We did have a discussion on the names and we landed on these for a few reasons but definitely can/should be discussed. With regards to what ends up being passed to the fitting code, this could probably take a few different forms. First, we wanted to completely standardise the information being passed from the instrument loader to the fitter so both can take any form that's needed. This meant we broke down the fitting process and only wanted to move over the information that was needed for that single spectrum to be fitted, removing everything else (this can have a big effect on the speed of the code, e.g. in parallelisation, which did surprised me).

Passing the spectral response is tricky since it depends heavily on the data being fitted but, like you've said, could get horrendously complicated if it also depends each iteration of the model. At the minute, the main thing we wanted to do was have the response handled in class of its own. Whether this class gets passed or just the matrix itself might depend.

In the linked code in the init for Spectrum units are being assigned to the counts, errors, edges etc I think this could be dangerous and also not really the job of the Spectrum object. It could/should certainly check the validity of the units of the data passed in but not convinced changing/adding units is a good idea. Also for the error I could see passing in a scalar value as possibility?

Again this is a good point to discuss. The reason why we've tried to stipulate some restrictions on the units here is because the units are arguably one of the easiest parts to get wrong of the whole fitting process. One way this could be made easier is to have astropy handle all the unit conversions and checks; however, a reason we've decided to make sure we're in keV for the spectrum is because an instrument's spectrum might be in hertz or angstrom (for whatever reason) which is easy for a user to convert appropriately to keV, if needed, whereas astropy just throws an error since they're different types of units.

Maybe there are a few solutions to this when thinking of the whole package: robustly check units of the data and ensure we can convert between all units and make sure the data is ready to work with the defined model units; make sure all the models can compute their outputs in a number of different units or are converted to the ones needed for the data; or stipulate the units needed at each interface between all key parts of the package.

Probably should go for the first one but the last one might not be the worst starting point. A route could be to check it works with the defined units then make it more general?

I don't have a good answer but I'm not sure attaching the response to the Spectrum is the best idea. It breaks the single responsibly idea and also turns what was a data container into something more. I can see the usefulness but perhaps making a different object which combines and Spectrum and Spectral/Instrument response or moving it to the model side could be more flexible? Plus diagrams.

I think what you've said could work and would be easily done, or started at least. This was one motivation behind having the SRM being given room for its own class instead of just being treated as an array all the time. The instrument loader class should be able to interact with the SRM at all times since it will most likely need changed if the data is edited/changed. Although, once it is passed to the fitter then it can take its instructions from the model side with regards to its photon axis.

What's not indicated on the diagrams and I haven't thought about enough is that you can have various mappings between the data + response and model/s fit all data to a single model, have pairs of data + models, fits each data + response par to different models (can't really see a use case for this)

Do you mean things like simultaneous fitting when different units are involved? If so, this stuff should be handled in the details of the fitter object with the setup definitely being the most complicated part. Once everything is setup properly (stuff in consistent units, free model parameters being set, etc.) this only comes down to what is being optimised/sampled/etc which shouldn't be too hard.

Obviously, there is a lot of this stuff that relates to the package as a whole and not just the instruments section/API. I'm hoping with starting the refactor here we can get a solid foundation on what the data we're actually using looks like and get the ball rolling with the refactor as a package-wide thing. The instrument code can be edited as we go along with the package refactor before settling on a final form and merging to incorporate any decision changes etc.

Do you think it would be better to have one large PR dealing with the refactor as a whole or a few individual PRs dealing with different sections of the code (instruments, models, statistics, fitter, etc.) which can then be merged about the same time when it's done?

Not sure which issue we're using for this discussion. But see my summary of our discussion at the STIX meeting and my current suggested approach here on issue #81

sunpy / sunkit-spex