A customizable package for downloading and processing images from various astronomical surveys.
PostgreSQL version 14 to run the database server.
sudo apt install postgresql libpq-dev
Download the code from github:
git clone https://github.com/guynir42/AstroRetriever.git
To develop and contribute to this repo,
first make a fork
on the GitHub page, then clone your fork to your local machine
(i.e., replace guynir42
with your GitHub username).
Generate a virtual environment and install the required python packages:
cd AstroRetriever/
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
Virtual environments can be deactivated using deactivate
.
To re-activate the environment, use source venv/bin/activate
,
inside the AstroRetriever
directory.
This makes sure that the code uses the specific versions of the packages
given by the requirements.txt
file.
To make sure the pacakge is working, either after installation or after making changes, run the unit tests:
pytest
The raw data folder should be defined using an environment variable RETRIEVER_DATA
.
This folder should be on a drive with enough space to contain the raw data.
Internally, the DATA_ROOT
variable in src/dataset.py
can be modified
to temporarily save data to other places (e.g., for testing):
from src import dataset
dataset.DATA_ROOT = "/path/to/data/folder"
You may want to generate folders for catalogs
and configs
in the root AstroRetriever
directory.
These folder may be generated automatically.
The goal of this project is to download, process, and store data (and retrieve it!) originating in various astronomical surveys.
The workflow is as follows:
Each of these steps uses parameters that are stored in a configuration file. Each of these steps produces data products that are saved in a way that is easy to retrieve from disk and to continue the workflow without having to redo previous steps. We use several modes of persistence, and objects that lazy load the content from disk when needed.
To simply download data from an observatory, initialize the appropriate virtual observatory object:
from src.tess import VirtualTESS
obs = VirtualTESS(...)
Give any required parameters to the constructor or directly to the obs.pars
object.
Then it is possible to get a source based on the source name in a catalog,
or based on the internal numbering, e.g., the TESS TIC ID number:
source = obs.download_by_tic(tic_number)
Will download the data for the source with the given TIC ID number.
If the observatory is connected to a catalog,
using a fetch_source
will get the source based on
the catalog coordinates, but will also save that source
to the database and to disk. So repeated calls to fetch_source
will not download the same source twice.
cat_row = obs.catalog.get_row(index, "number", "dict")
source = obs.fetch_source(cat_row, save=True, reduce=True)
The cat_row
can be any dictionary that includes
the ra, dec and name of the source.
Using get_row
allows iterating over the catalog.
The reduce=True
option will also produce reduced data.
By default, the observatory object is meant to download
only photometry (lightcurves). This can be modified
by changing the observatory parameters' data_types
.
To get all the sources in the catalog, call
obs.fetch_all_sources(start=0, stop=None, save=True, reduce=True)
This will iterate between the first catalog entry (start
)
and the last catalog entry (stop
) and save any new data.
To also run analysis on these sources,
simply initialize a Project
and set it to run()
over
the catalog:
proj = Project(name='new_project', obs_names='TESS', catalog_kwargs={'default': 'WDs'})
proj.run()
See more complete usages in the examples.
Examples for storing and retrieving data products:
Project
object can load a configuration file and store the parameters in memory.Catalog
can read FITS or CSV files with a table of sources.Source
and RawPhotometry
and Lightcurve
are persisted in a postgres database.RawPhotometry
keep the name and path of a file on disk
(in HDF5 or FITS format), and load the data from disk when needed.Histogram
object is associated with a netCDF file on disk,
which is loaded into a multidimensional xarray
when needed.Data that is saved to disk is automatically retrieved
by AstroRetriever
so that each call to a method like
run()
on the project object, or fetch_all_sources()
on the observatory object, will continue from where it left off,
skipping the sources and files it has already completed.
Raw data is downloaded from separate surveys,
and is saved under the DATA_ROOT
folder,
(by default this is given by the environmental variable RETRIEVER_DATA
).
Under that folder data is saved in a sub-folder for each observatory
(observatory names are pushed to upper case when used in folder names).
For example, saving data from ZTF will put the raw
data files into the DATA_ROOT/ZTF
folder.
Reduced data, as well as analysis results,
are saved in a separate folder for each project.
The project name is used (in upper case), as in
DATA_ROOT/PROJECT_NAME
.
If using version control, that folder would
be appended the hash of the final config file,
to make sure that the data is not overwritten
when the config file is changed.
E.g, DATA_ROOT/PROJECT_NAME_<sha256 hash of config file>
The project folder will also contain the output config file
used in the analysis, saved as config.yaml
.
Raw data files are saved in the DATA_ROOT/<OBSERVATORY>
folder,
with subfolders and filenames that help find the data easily even without the
associated database objects.
By default, datasets for all sources in an integer right ascension (RA) degree bin,
are saved in a sub-folder like RA012 (where the RA range is 12-13 degrees).
The filename itself would be
Note that we always save all raw data for a given source, from a given observatory, into a single file, and a single key (HDF5 group). This makes it easier to keep track of the raw data, that is potentially re-used across multiple projects or versions. The data products derived from the raw data (e.g., reduced lightcurves) could have multiple keys/groups for the same source, as explained below.
Once some raw data is on disk, data products can be generated using the observatory object or the analysis object. The observatory object can be used to reduce the data, making it easier to work with and analyze. For example, raw photometry data can be reduced to lightcurves:
lcs = obs.reduce(source, data_type="photometry")
will produce some Lightcurve
objects,
possibly more than one lightcurve per raw dataset.
This is because the raw data may contain data from multiple filters,
or from different observing seasons/runs.
The details of how a RawPhotometry
object is reduced to a Lightcurve
object
are determined by the specific observatory object and reduction parameters.
When saving a Lightcurve
object,
it would be placed, by default, into a folder named after the project (in upper case),
with a sub-folder and filename identical to the raw data file,
with an appended _reduced
suffix before the file extension.
E.g., DATA_ROOT/TEST_PROJECT/RA123/DEMO_photometry_source456_reduced.h5
.
This is because the reduced data depends on the details of the project,
and is not shared across projects.
If using version control, the folder would be appended the hash of the final config file,
as explained below.
The file key for the lightcurve object would be the source name in the catalog,
appended with _reduction_XX_of_YY
where XX
is the number of the reduction (starting at 1),
out of a total of YY
reductions.
The reduction number is ordered by the start time of observations of each of the lightcurves.
The way version control works is that
when running a project (using the run()
command),
the parameters for all different classes are collected
into an output config file, including changes applied by the user
through interactive or script commands.
The output config file will also contain
a hash of the git commit that was used,
so that even when running the same project
with the same parameters but with different code versions,
the final config hash will be different.
The output config file is saved in the project folder,
for future reference, and its sha256 hash
(the value is stored in the cfg_hash
attribute)
is used to tag the data products folder,
containing the reduced data and the analysis results.
To turn on version control, specify the parameter
version_control: true
in the project config.yaml file,
or to the Project
constructor (Project(version_control=True)
)
or directly to the parameters object
project.pars.version_control = True
.
You can also use vc
as short for version_control
.
When disabled, the output folder will just be named
by the project name (in upper case),
and the content in it could be
outdated if the code/parameters were changed.
This is useful for exploratory analysis.
Note that when version control is off,
AstroRetriever
will quietly re-use
existing data products even if changes were made
to the code or parameters.
If version control is disabled,
the user must be responsible for clearing
the old database rows and data folders when making substantial changes.
An important caveat is that raw data, downloaded directly from each survey, is not managed by version control, and is shared between projects and versions. This is because downloading raw data usually takes a long time, and is meant to be unaffected by the code or parameters chosen for different projects.
The module contains several classes.
The pipeline classes will have a Parameters
object so they can be initialized
and configured by user input or by a config file.
They generally do not save their internal state, and can be re-initialized
and quickly continue their download/analysis process from where they left off.
Project
: A top level object that handles the configuration parameters for downloading, saving and analyzing data.
Each project would have a list of observatories, a catalog, and a unique project name that associates all database objects
with a particular project. After initializing a project, with a given configuration, the object can then download
any files not yet saved to disk, load files from disk, reduce the data and run analysis.Parameters
: Keep track of parameters from the user input or from a config file.
These objects are attributes of objects like Project
, VirtualObservatory
or Analysis
(among others),
so all the tunable parameters can be maintained for each parent object.
Each class that has parameters will usually add a subclass of the Parameters
class,
where specific attributes are added using the add_par
method. See for example the ParsProject
class.Catalog
: stores a list of objects, each with a name, RA, Dec, magnitude etc.
Has code to load such a list from CSV, FITS or HDF5 files, and optionally has code to download such files.VirtualObservatory
: contains methods to download and store images, lightcurves, or other datasets.
Subclasses of the VirtualObservatory
class are used to download data from specific surveys.
An example subclass would be VirtualZTF
to download and reduce data from the Zwicky Transient Facility
(https://www.ztf.caltech.edu/).
The observatory objects are also used to reduce data (e.g., from raw photometry to usable lightcurves).Analysis
: runs a custom pipeline to reduce the data to smaller summary statistics, find interesting events, etc.QualityChecker
: this object scans the reduced lightcurves and checks that data quality is good.
It can be used to flag bad data, or to remove bad data from the analysis. Generally each epoch in a lightcurve is flagged
using the qflag
column, which is True
for bad data, and False
if the data is good at that time.Finder
: this object scans the reduced lightcurves and calculates scores like S/N, that can be used to make detections.
The current implementation performs a simple peak detection in photometry data. Subclasses should implement more sophisticated
tools for setting the scores of each lightcurve, use more complicated detection algorithms (e.g., using a matched-filter)
or run analysis on other data types like spectra or images.
The combination of the QualityChecker
and Finder
's operations on the reduced data (lightcurves or otherwise)
is to make "processed" data products. These are used for detections and for parameter estimation.
Both classes should be able to run multiple times on the same data (without memory of the previous runs)
such that they can be applied to the original data once, and to any simulated data with injected events
multiple times to test the detection rates, etc.Simulator
: takes the processed data that is used by the Finder
detection code and injects a simulated event into it.
The Analysis
object automatically applies this "simulation mode", which produces Detection
objects marked as "simulated".The data classes are mapped to rows in the database. Each one contains some metadata and some also contain paths to files on disk. Data class objects are how we store outputs from the pipeline and quickly query them later.
Source
: an entry for a single astronomical object. Includes a row from the catalog,
and is associated with lightcurves, images, etc., analysis results
in the form of a Properties
object and zero or more Detection
objects.
Each Source
can be given to an Analysis
object, which appends a Properties
object to the source,
and may find some Detection
objects.DatasetMixin
: a base class for various types of data, including images, lightcurves, and other data.
A DatasetMixin
has the ability to save data to a file and later retrieve it.
All astronomical data classes have times
(a vector of datetime objects) and mjds
(a vector of modified Julian dates) as attributes. They all have a filename and if required,
will also keep an in-file key for files containing multiple entries (e.g., HDF5 files with groups).RawPhotometry
: used to store unprocessed photometric data. Inherits from DatasetMixin
.
This class is mostly used to store filenames to allow data to be easily saved/loaded for future analysis.
There is one RawPhotometry
object associated with each Source
for each observatory.
Raw data is saved once per observatory, and can be re-used in different projects (if they share source names).
If the source has no data for an observatory, it must still have a RawPhotometry
object,
with is_empty=True
, and an empty dataframe on disk. This allows AstroRetriever
to check if
a source has already been downloaded (even if no data was returned, it still counts as "downloaded").Lightcurve
: a set of time vs. flux measurements for a single object. Inherits from DatasetMixin
.
There are zero or more Lightcurve
objects for each Source
.
Each Lightcurve
is for a single filter in a single survey, usually for a single observing season/run.
An Analysis
applied to a Source
can, for example, use the lightcurves as the input data.
There are one or more Lightcurve
objects associated with each RawPhotometry
object.
Lightcurve
objects can be used as reduced, processed, or simulated datasets.Detection
: a class for all sorts of detection objects. For example, a photometry based search would produce
objects that store information about a time-local events like transients.
Other detection objects may have attributes for wavelengths in spectra, for pixel coordinates in images, etc.
A Source
can contain zero or more Detection
s.Properties
: a class for storing summary statistics for a Source
.
An Analysis
applied to a Source
will produce one such object for each source that was analyzed.
It may contain simple statistics like best S/N or more complicated results from, e.g., parameter
estimation methods. There are zero or one Properties
objects associated with each Source
.
If no Properties
object is associated with a Source
, that means it has not yet been analyzed.Histogram
: contains a multidimensional array (in xarray
format) that saves statistics on the analysis.
This can be important to maintain a record of, e.g., how many epochs each source was observed,
so any detections (or null detections) can be translated into rates or upper limits on rates.
The multiple dimensions of these histograms are used to bin the data along different values
of the input data like the source magnitude (i.e., having many epochs on faint sources may be less
useful than having a few epochs on a bright source).
NOTE: currently the Histograms
object does not have an associated row in the database.
Instead, the different histograms are saved as netCDF files in the project output folder.The directory structure is as follows:
src
is the source code directory, containing all the python files.tests
contains unit tests for the code.catalogs
contains files with lists of astronomical objects.configs
contains configuration files for various projects.results
contains analysis results.Important files in the src
folder are:
analysis.py
, catalog.py
, detection.py
, finder.py
, histogram.py
, parameters.py
,
project.py
, properties.py
, simulator.py
, and source.py
.observatory.py
contains the VirtualObservatory
class, which is a base class for all survey-specific classes.
It also contains the VirtualDemoObs
which is an example observatory that can be used for testing data reduction.database.py
does not contain any classes, but is used to set up the database connection using SQLAlchemy.dataset.py
contains the DatasetMixin
class, which is a base class for all data types.
It also contains the RawPhotometry
class, which is used to store filenames for data that has not yet been reduced,
and the Lightcurve
class, which is used to store reduced photometry (lightcurves).quality.py
contains the QualityChecker
class, which is used in conjunction with the Finder
class inside Analysis
.utils.py
contains some useful functions and classes.ztf.py
contains the VirtualZTF
class, which is a subclass of VirtualObservatory
and is used to download data from ZTF, and reduce the data into usable products.tess.py
contains the VirtualTESS
class, which is a subclass of VirtualObservatory
,
and is used to download data from the Transiting Exoplanet Survey Satellite (TESS).Define a project and run the full pipeline:
from src.project import Project
proj = Project(
name="default_test", # must give a name to each project
description="my project description", # optional parameter example
version_control=False, # whether to use version control on products
obs_names=["ZTF"], # must give at least one observatory name
# parameters to pass to the Analysis object:
analysis_kwargs={
"num_injections": 3,
"finder_kwargs": { # pass through Analysis into Finder
"snr_threshold": 7.5,
},
"finder_module": "src.finder", # can choose different module
"finder_class": "Finder", # can choose different class (e.g., MyFinder)
},
analysis_module="src.analysis", # replace this with your code path
analysis_class="Analysis", # replace this with you class name (e.g., MyAnalysis)
catalog_kwargs={"default": "WD"}, # load the default WD catalog
# parameters to be passed into each observatory class
obs_kwargs={'reducer': {'radius': 3, 'gap': 40}},
# specific instructions to be used by the ZTF observatory only
ZTF={"credentials": {"username": "guy", "password": "12345"}}
)
# download all data for all sources in the catalog
# and reduce the data (skipping raw and reduced data already on file)
# and store the results as detection objects in the DB, along with
# detection stats in the form of histogram arrays.
proj.run()
This project will download a Gaia-based FITS file
with a list of white dwarfs (the default "WD" catalog)
and set up a ZTF virtual observatory with the given credentials.
The run()
method will then backfill data products
that are missing from the data folder.
For example, the project will not download raw data that has already
been downloaded, it will not reduce data that has already been reduced,
and will not produce analysis results for sources that have already been analyzed.
The results would be that each Source
in the database
will have some reduced lightcurve files saved in the project
output folder, each one reference by a Lightcurve
object in the database.
Each Source
will also have a Properties
object in the database,
which marks it as having been analyzed.
Some Detections
objects in the database could be created,
each linked to a specific Source
.
Finally, there would be histogram files containing statistics on the
data that has been processed so far.
Optionally, there could be additional Lightcurve
objects on the DB,
linking to files in the output folder, that contain processed data
and simulated data (which is the same as processed, but with injected events).
Input parameters can be saved in a configuration file
instead of specified in the constructor.
To call that file, use the special keywords
cfg_file
to specify the file name,
and the cfg_key
to specify the key in the file
to load for into that specific object (a project in this example).
Most classes will have a default key name already set up,
so it is usually ok not to specify the cfg_key
parameter
(this is done using the Parameters.get_default_cfg_key()
method,
which is defined for each subclass of Parameters
)
For example, the above code can be written as:
proj = Project(name='project_name', cfg_file="project_name.cfg", cfg_key="project")
The cfg_file
can be an absolute path, otherwise it should be
specified without a relative path, and is assumed to be in the configs
folder.
In addition, even if the cfg_file
and cfg_key
are not specified,
the Parameters
object of the project will try to find a file
named configs/<project name>.yaml
and look for a project
key in it.
Only if cfg_file
is specified, the code will raise an error if no such file exists.
If cfg_file=None
the default file will be searched for, but will not raise an exception
if no such file exists (named after the project and stored in the configs
folder).
To explicitly avoid loading a config file, use cfg_file=False
.
Note that passing arguments to a sub-object (e.g., Analysis
inside a Project
)
the user will pass an analysis_kwargs
dictionary to the Project
constructor.
In the config file, these arguments should be specified under the analysis
key,
not under the project
key.
If, however, we want to load a different class (using custom code that inherits from Analysis
),
the analysis_module
and analysis_class
parameters should be specified under the project
key.
This is because the module/class must be given to the containing object to be able to construct the sub-objects.
Only after they are constructed, the sub objects can search for the appropriate key in the config file.
If mixing config file arguments and constructor arguments,
the constructor arguments override those in the config file.
If the user changes any of the attributes of the Parameters
objects,
after the objects are already constructed,
those new values will be used instead of the previous definitions.
Note that some classes may require some initialization after parameters are changed
(e.g., a Histogram
object must be initialized after its parameters are chosen).
It is recommended that the user inputs all parameters in one method:
The last method may require some re-initialization for some objects.
The analysis pipeline given as default by AstroRetriever
is rather limited.
It calculates the S/N and searches for single-epoch detections.
More complicated analysis can be used by changing the parameters or by adding custom code
as subclasses in which some methods would be overwritten.
As an example, initialize a project with an Analysis object, but change the default detection threshold:
proj = Project(
name="default_test", # must give a name to each project
analysis_kwargs={"snr_threshold": 8.5},
)
This will create the default Analysis
object from analysis.py
,
The argument snr_threshold
is stored in the Analysis
object's pars
object,
which is a ParsAnalysis
object (a subclass of Parameters
).
This, however, limits the analysis to code already in the Analysis
class.
To subclass it, simply add a new file containing a new class (that inherits from Analysis
).
For example, create a new file my_analysis.py
with the following contents:
# extensions/my_analysis.py
from src.parameters import Parameters
from src.analysis import Analysis
class ParsMyAnalysis(ParsAnalysis):
"""
Parameters for MyAnalysis class
Will have all the parameters already defined in ParsAnalysis,
but add another one as an example.
"""
def __init__(self, **kwargs):
super().__init__(**kwargs)
# unlock this object to allow new add_par calls
self._enforce_no_new_attrs = False
self.num_kernels = self.add_par("num_kernels", 10, int, "number of kernels")
# re-lock this object so users cannot set attributes not already defined
self._enforce_no_new_attrs = True
class NewAnalysis(Analysis):
"""A class to do matched-filter analysis using multiple kernels. """
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.pars = ParsMyAnalysis(**kwargs)
# override this part of the analysis
def analyze_photometry(self, source):
"""
Run analysis on lightcurves, overriding
the simple analysis code in the base class.
This part will probably make use of the
`self.pars.num_kernels` parameter.
Because the rest of the Analysis class is the same,
we don't need to worry about simulations,
about making detection objects, etc.
If the detection object needs to be modified,
we should also override the `make_detection` method.
"""
...
The file in this example is put into the extensions
folder,
but it can be saved anywhere (including the src
folder).
Initialize the project with the new class:
proj = Project(
name="default_test", # must give a name to each project
analysis_kwargs={"num_kernels": 20},
analysis_module="extensions.my_analysis",
analysis_class="NewAnalysis",
)
Use the default white dwarf (WD) catalog from https://ui.adsabs.harvard.edu/abs/2021MNRAS.508.3877G/abstract.
from src.catalog import Catalog
cat = Catalog(default="WD")
cat.load()
The load()
method will download the catalog file if it is not already present,
and read it from memory if it is not already loaded.
To use one observatory to download the data (ZTF in this case):
proj = Project(name="default_test", obs_names=["ZTF"])
obs = proj.observatories["ZTF"] # can also use proj.observatories[0] or proj.ztf
obs.pars.num_threads_download = 0 # can use higher values for multi-threading
obs.fetch_all_sources(0, 1000, save=True) # grab first 1000 sources in catalog
# use save=False to only download the data and not save it (for debugging)
len(obs.sources) # should only contain 100 latest Source objects
len(obs.raw_data) # should only contain 100 latest RawPhotometry objects
This code will download all the data for the first 1000 sources in the catalog. If any of the raw data has already been downloaded, it will not be downloaded again.
Internally, the VirtualObservatory
superclass will
take care of database interaction and file creation,
possibly running multiple threads
(controlled by the pars.num_threads_download
parameter).
Inside this code is a function download_from_observatory()
,
an abstract method that is only implemented in observatory subclasses
(i.e., not on the base class).
This function gets a cat_row
dictionary with some
info on the specific source (name, RA/Dec, magnitude, etc.),
and additional keyword arguments,
and produces two values: data
(generally a pandas
dataframe)
and an altdata
dictionary with additional information about the data (metadata).
This function calls whatever specific web API is used by the
specific observatory to download the data.
Usually it is enough to only implement this function and the reduction function.
To reduce lightcurves, first define an observatory
with optional reducer
parameters.
from src.ztf import VirtualZTF
obs = VirtualZTF(
name="ZTF",
credentials={"username": "guy", "password": "12345"},
reducer={"gap": 60},
)
Load the raw data and split it into lightcurves for different filters:
from src.dataset import RawPhotometry
data = RawPhotometry(filename="my_raw_data.h5")
data.load()
obs.pars.reducer = {'gap': 60}
lcs = obs.reduce(data, data_type='photometry')
This should produce a list of Lightcurve
objects, one for each filter/season.
In this case the gap
parameter is used to split the lightcurve into multiple
"seasons" if there is a gap of more than 60 days between epochs.
To override the methods that reduce the data,
usually it is enough to implement the reduce_photometry()
method
(or other reduction if the data is imaging or spectroscopy, etc.).
Below are some examples for how to load and use the data stored in various ways across database and filesystem.
Database objects can be retrieved using SQLAlchemy queries.
Each database interaction is started using a session
object.
import sqlalchemy as sa
from src.database import Session
session = Session()
To get full objects (rather than tuples with specific columns)
use the session.scalars()
.
Inside the scalars
function, use the sa.select(Class)
method
to select from one of the tables of mapped objects
(mapped classes include Source
, RawPhotometry
, Lightcurve
, Detection
, etc).
Use the all()
or first()
methods to get all or the first object.
To filter the results use the where()
method on the select statement object.
sources = session.scalars(sa.select(Source)).all()
source = session.scalars(
sa.select(Source).where(
Source.name == 'J123.1-32.13'
)
).first()
To add a new object to the database, or to delete an object from the database use:
session.add(new_object)
session.delete(object_to_delete)
Make sure that the data on disk is deleted first, e.g.,
data = session.scalars(
sa.select(RawPhotometry).where(
RawPhotometry.source_id == source.id
)
).first()
data.delete_data_from_disk()
session.delete(data)
Finally, each transaction should end with a session.commit()
,
to transfer the changes to the database.
If there are any errors, use session.rollback()
to undo the changes.
When using the session interactively,
one should remember to call session.close()
when done.
When using the session in a script,
it is best to use the with
statement to ensure that the session is closed.
with Session() as session:
source = session.scalars(
sa.select(Source).where(
Source.name == 'J123.1-32.13'
)
).first()
data = source.raw_photometry[0]
lcs = obs.reduce(data, data_type='photometry')
for lc in lcs:
lc.save()
session.add(lc)
session.commit()
Use the src.database.SmartSession
inside a context manager
if you'd like the session to automatically close at the end.
The SmartSession can also receive another session or None
as an argument.
In the first case, it would leave the session open, so it can be closed by
the external scope. If None
, it will open and close within the context.
Also, pass False
to return a no-op session that is not connected to the database.
To disable all instances of database access that pass through the SmartSession
,
set the global src.database.NO_DB_SESSION
to True
.
Raw data and reduced data should always be associated
with a database object, e.g., a RawPhotometry
or Lightcurve
object.
Each of them has a get_fullname()
method that returns the full path
to the file on disk, and if relevant, also has a filekey
attribute
that keeps track of the in-file key for this dataset (e.g., in HDF5 files).
To load the data, simple use the load()
method of the object,
or just refer to the data
attribute of the object (it will lazy load the data).
To save new data to disk use the save()
method.
If the data is associated with a source, the filename
will be generated using the source's right ascension,
and the file key will be the source name.
raw_data.save()
raw_data.filename # RA123/ZTF_photometry_J123.1-32.13.h5
raw_data.get_fullname() # /path/to/dataroot/ZTF/RA123/ZTF_photometry_J123.1-32.13.h5
raw_data.filekey # J123.1-32.13
lc.save()
lc.filename # RA123/ZTF_photometry_J123.1-32.13_reduced.h5
lc.get_fullname() # /path/to/dataroot/PROJ_NAME/RA123/ZTF_photometry_J123.1-32.13_reduced.h5
lc.filekey # J123.1-32.13_reduction_01_of_03
As long as the datasets are associated with a database object,
there is no need to know the file name, path or key.
Simply load the objects from database and use the data
attribute.
The dataset objects also have some plotting tools that can
be used to quickly visualize the data.
These will lazy load the data
attribute from disk.
with Session() as session:
data = session.scalars(
sa.select(RawPhotometry).where(
RawPhotometry.source_name == "J123.1-32.13"
)
).first()
mag_array = data.data['mag']
data.plot()
When running down the list of sources in a catalog, each one is first checked against the database to see if the data already exists. If the database object exists but the data is missing on disk, it will be re-downloaded or re-reduced and attached to the object. If the database object does not exist, the data is downloaded/reduced and a new object is added to the database.
The raw data is only downloaded once for all projects, assuming the different projects use the same name for the source. Other intermediate products are saved for each project separately.
In some cases, raw data remains on disk while the database is cleared.
If this happens, use the observatory populate_sources
method to
scan all the data files in DATA_ROOT
and associate a source from the catalog
with each record on disk.
This may require some tweaking of the observatory parameters,
so the file key matches the correct column in the catalog.
For each Source
we use the Analysis
object
to reduce the data to some summary statistics,
that are persisted and queryable for all sources.
The products of the Analysis
include:
Properties
objects, one per source, that store the summary statistics.
and lets us know the source was already analyzed.Detection
objects, which can be used to find real or simulated events.Histogram
class.
These are additive to allow joining results from parallel sessions,
each outputting different Histogram
objects.A Simulator
object can be used to inject simulated events into the data,
which produces simulated Detection
objects. This allows testing of the
sensitivity of the analysis. These objects are saved along with the
real events but are marked as simulated using a hidden attribute.