aarchiba commented 1 year ago

This PR allows Enterprise to write Pulsar objects out to HDF5 files in a well-documented format, and to read them back into a new FilePulsar (name suggestions welcome) which should be a drop-in replacement for either PintPulsar or T2Pulsar. The format is flexible enough that downstream users can add their own information in the file (and have these extras included in the documentation); these files can be loaded without needing to understand these extras.

The HDF5 format includes compression for all large-ish entries (in particular the mostly-zero DMX derivatives); the example B1855 data set comes out to about 1.2 MB.

To do:

[x] Test likelihood computation or other non-trivial use of FilePulsar objects
[x] Add saving par and tim files for T2Pulsar
[x] Determine whether any additional data from T2Pulsar should be included
[x] Determine what to do with unrecognized entries (should they go into a dictionary or something so the user doesn't need to poke around in the HDF5 file themselves? what if they are huge? maybe we should accept open HDF5 files as well as filenames?)
[x] Test the file format machinery by using it to produce derivative files for the NANOGrav 15yr data set
[x] Determine whether there are additional things that should go in the file even if they aren't needed for Enterprise
[x] Write appropriate documentation

Note: this PR depends on #340 ; many of the apparent changes here are drawn from that, and this PR may well merge in any changes that PR needs.

The file format allows extensions for project-specific information. Here is what the current description file for the NANOGrav 15-year v1.1 data set looks like (it is Markdown; GitHub uses a weird flavour of Markdown that preserves line breaks, normal viewers will reflow paragraphs normally):

NANOGrav 15-year data release derivative data

The NANOGrav project is releasing its 15-year data set; this is described in an upcoming paper, but it includes long-term timing data for 68 pulsars.

Pulsar timing begins with a set of pulse arrival times and fits a model to those arrival times. The usual output from this process is the best-fit model parameters and their uncertainties, and the residuals - the difference in time or phase between the predicted zero phase and the observed zero phase.

For some applications, for example searching for a gravitational-wave background, it is vital to include not just these residuals but their derivative with respect to each of the fit parameters. This allows construction of a linearized version of the timing model, which can often be analytically marginalized, resulting in tremendous speedups. Other applications for such linearized models include parameter searches in photon data.

The purpose of this file is to provide the derivatives needed to construct this linear model, plus all other supporting data. It is stored in HDF5, a widely portable binary format that is extensible enough to permit project-specific information to be stored alongside standard values.

This text should accompany a collection of such files in plain-text form, and it should also be included in all such files as a dataset called "README".

This data

Timing results as of: 2022-03-14 21:56:13 +0000

Git hash: 78afc7978e267ae9d11ab5daf57e6438a56c528b

Generated: 2023-02-28 10:10:43

Generated by: Anne Archibald <Anne.Archibald@nanograv.org>

File contents

format_name (attribute, optional, constant value='derivative_file')
The name of this particular HDF5 format.
format_version (attribute, optional, constant value='0.6.0')
Version number indicating the compatibility of this file with other readers of this format.
Name (dataset)
Pulsar name.
RAJ (dataset, units=rad)
Right ascension in the Julian system. In radians.
DECJ (dataset, units=rad)
Declination in the Julian system. In radians.
DM (dataset, units=pc/cm3)
Best-fit dispersion measure, in pc/cm^3.
Estimated distance (dataset, units=kpc)
Estimated distance and uncertainty in kiloparsecs.
TOA integer part (dataset, optional, units=day)
This is the exact TOA, converted to TDB (barycentric dynamical time) but not corrected for travel time in any way. In order to retain nanosecond accuracy, this is split into two arrays: the integer and the fractional parts of the MJD. This dataset contains the integer part.
TOA fractional part (dataset, optional, units=day)
This is the exact TOA, converted to TDB (barycentric dynamical time) but not corrected for travel time in any way. In order to retain nanosecond accuracy, this is split into two arrays: the integer and the fractional parts of the MJD. This dataset contains the fractional part.
TOAs in seconds (dataset, units=s)
Pulse time-of-arrival data, in Modified Julian Days. These values are barycentered, that is, converted to times that the pulses would have reached the solar system barycenter. (This depends on the pulsar sky position.) Note that this array has only about microsecond resolution and so is insufficient to do precision timing.
Raw TOAs in seconds (dataset, units=s)
TOAs at the observatory; this is corrected for observatory clock drift but not converted to any other time system or adjusted to when the pulses would have reached the solar system barycenter. This has also been converted to seconds, that is, the Modified Julian Date has been multiplied by 86400. This array too has only about microsecond precision.
TOA uncertainties (dataset, units=s)
Uncertainties on pulse time-of-arrival data (and thus on residuals), in seconds.
Residuals (dataset, units=s)
Residuals (model minus data, in seconds).
Radio frequencies (dataset, units=MHz)
Radio frequency at which each TOA is observed, in MHz. This frequency is corrected for Doppler shift due to the observatory's motion around the Sun.
Telescope names (dataset)
The name of the telescope at which each TOA was observed. These names are PINT- (or TEMPO2-)style telescope names (for example arecibo).
Fit parameters (dataset)
Fitted parameters.
Design matrix (dataset)
Design matrix. This is an array that is (number of TOAs) by (number of fit parameters). Each column is the derivative of the residual (in seconds) with respect to the corresponding fit parameter. This dataset has an attribute labels that indicates the labels of the design matrix entries (which will be identical to the fit parameters) and units giving the units of the design matrix entries. These units are stored in Astropy's "generic" string format for units, which is based on that used in FITS files.
Set parameters (dataset)
Parameters of the timing model that were fixed during fitting. Not all of these even have numeric values.
Par file (dataset, optional)
A .par file describing the timing model, as a string. This can be quite long if the model has many DMX parameters. The value is stored as an array of UTF-8 byte strings, one per line.
Tim file (dataset, optional)
A .tim file recording the full TOA information. This is in the form of an array of strings (UTF-8 encoded), one per line. The file is in TEMPO2 format, so will normally contain more lines than there are TOAs.
Pulsar sky position (dataset)
Unit vector pointing to the pulsar's sky position, in equatorial coordinates.
Pulsar sky position as a function of time (dataset)
Unit vector pointing to the pulsar's sky position, in equatorial coordinates, as a function of time (three values per TOA).
Sun positions (dataset, units=ls)
Sun positions (and possibly velocities) relative to the solar system barycenter, in light-seconds. This array will be (number of TOAs) by 6. If the Sun velocities are unavailable they will be set to zero.
Planet positions (dataset, optional, units=ls)
Planet positions (and possibly velocities) relative to the solar system barycenter, in light-seconds. This array will be (number of TOAs) by 9 by 6. The planets are in order outward from the Sun, including Pluto. If not all planet positions or velocities are available, the unknown entries will contain NaNs. PINT generally computes only positions and only for the Earth, Jupiter, Saturn, Uranus, and Neptune.
DMX (dataset)
DMX information. This describes a time-variable dispersion measure to the pulsar using a piecewise-constant model. Each piece covers a specified range of TOA times and specifies a delta-DM that should be added to the pulsar's overall DM value within the corresponding time interval. This will be recorded in the HDF5 file as a group, with a sub-group for each DMX piece; the relevant values are recorded as attributes of this sub-group.
Flags (dataset)
Flags associated with TOAs. The tempo2 format allows a flexible list of flags to be associated with each TOA; these often record details like the observing frontend and backend. There is a list of flags recommended by the International Pulsar Timing Array. This entry is an HDF5 group, which contains an HDF5 dataset for each flag that occurs in the file; the dataset contains UTF-8-encoded string values for that flag for each TOA.
yaml (attribute, optional)
Name of configuration file (in yaml format) used to generate this data.
git_hash (attribute, constant value='78afc7978e267ae9d11ab5daf57e6438a56c528b')
Hash selecting specific version of the git repository (including configurations and par files) used to generate the data.
git_date (attribute, constant value='2022-03-14 21:56:13 +0000')
Last modification date of the git repository (including configurations and par files) used to generate the data.
generated_date (attribute, constant value='2023-02-28 10:10:43 ')
Date this file was generated.
generated_by (attribute, constant value='Anne Archibald <Anne.Archibald@nanograv.org>')
Person who generated this file (if not automatic).

Referencing

If you do use this data for something, please reference both the NANOGrav 15-year data release paper and the DOI for this data set.

codecov[bot] commented 1 year ago

Codecov Report

Merging #341 (26b53b8) into master (5ef5ff4) will increase coverage by 0.48%. The diff coverage is 91.64%.

Additional details and impacted files

[![Impacted file tree graph](https://codecov.io/gh/nanograv/enterprise/pull/341/graphs/tree.svg?width=650&height=150&src=pr&token=7Sjk8cLA85&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv)](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) ```diff @@ Coverage Diff @@ ## master #341 +/- ## ========================================== + Coverage 88.37% 88.86% +0.48% ========================================== Files 13 15 +2 Lines 3012 3350 +338 ========================================== + Hits 2662 2977 +315 - Misses 350 373 +23 ``` | [Impacted Files](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) | Coverage Δ | | |---|---|---| | [enterprise/pulsar.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9wdWxzYXIucHk=) | `92.08% <85.88%> (+0.02%)` | :arrow_up: | | [enterprise/derivative\_file.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9kZXJpdmF0aXZlX2ZpbGUucHk=) | `88.75% <88.75%> (ø)` | | | [enterprise/h5format.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9oNWZvcm1hdC5weQ==) | `95.14% <95.14%> (ø)` | | ------ [Continue to review full report at Codecov](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). Last update [5ef5ff4...26b53b8](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv).

aarchiba commented 1 year ago

I'm not quite sure what documentation is expected for Enterprise code.

paulthebaker commented 1 year ago

I'm not quite sure what documentation is expected for Enterprise code.

Docstrings for all new things is the best way to do it. But there is a ton of existing code that doesn't yet have docstrings...

Certainly, any function/method that is user facing should have a docstring. For something that is more for internal uses and is pretty clear from the code, I wouldn't fret too much about it.

Something like write_dict_to_hdf5 probably doesn't need a docstring, but if you wanted to add one I wouldn't stop you.

vhaasteren commented 1 year ago

@aarchiba, with an HDF5 file format, it seems pretty straightforward to also allow creation of mock pulsar objects. The reason why you created the HDF5 pulsar class is that we sometimes need independence from PINT or Tempo2. This is even more so with simulations.

The other day I needed to generate an array of 10k pulsars. With tempo2 or PINT that would take up too much memory and would take ages. So I wrote this MockPulsar Enterprise class that is super efficient and still allows all many Enterprise models that I needed to test to be run: most Enterprise functions only require things defined in BasePulsar.

The HDF5 pulsar format seems like the best place to put this kind of functionality, and then it can be saved in a proper file format. Do you see any hiccups right away?

vhaasteren commented 1 year ago

The code in this PR has been converted to a separate package by @AaronDJohnson and myself, which can be found here. There will be a new Enterprise PR that needs to be merged in order for that package to work. Until that is ready, the branch can be found on on my repo

I am closing this PR

nanograv / enterprise

Read and write Pulsar objects to HDF5 files #341

NANOGrav 15-year data release derivative data

This data

File contents

Referencing

Codecov Report