nanograv / enterprise

ENTERPRISE (Enhanced Numerical Toolbox Enabling a Robust PulsaR Inference SuitE) is a pulsar timing analysis code, aimed at noise analysis, gravitational-wave searches, and timing model analysis.
https://enterprise.readthedocs.io
MIT License
67 stars 67 forks source link

Read and write Pulsar objects to HDF5 files #341

Closed aarchiba closed 1 year ago

aarchiba commented 1 year ago

This PR allows Enterprise to write Pulsar objects out to HDF5 files in a well-documented format, and to read them back into a new FilePulsar (name suggestions welcome) which should be a drop-in replacement for either PintPulsar or T2Pulsar. The format is flexible enough that downstream users can add their own information in the file (and have these extras included in the documentation); these files can be loaded without needing to understand these extras.

The HDF5 format includes compression for all large-ish entries (in particular the mostly-zero DMX derivatives); the example B1855 data set comes out to about 1.2 MB.

To do:

Note: this PR depends on #340 ; many of the apparent changes here are drawn from that, and this PR may well merge in any changes that PR needs.

The file format allows extensions for project-specific information. Here is what the current description file for the NANOGrav 15-year v1.1 data set looks like (it is Markdown; GitHub uses a weird flavour of Markdown that preserves line breaks, normal viewers will reflow paragraphs normally):

NANOGrav 15-year data release derivative data

The NANOGrav project is releasing its 15-year data set; this is described in an upcoming paper, but it includes long-term timing data for 68 pulsars.

Pulsar timing begins with a set of pulse arrival times and fits a model to those arrival times. The usual output from this process is the best-fit model parameters and their uncertainties, and the residuals - the difference in time or phase between the predicted zero phase and the observed zero phase.

For some applications, for example searching for a gravitational-wave background, it is vital to include not just these residuals but their derivative with respect to each of the fit parameters. This allows construction of a linearized version of the timing model, which can often be analytically marginalized, resulting in tremendous speedups. Other applications for such linearized models include parameter searches in photon data.

The purpose of this file is to provide the derivatives needed to construct this linear model, plus all other supporting data. It is stored in HDF5, a widely portable binary format that is extensible enough to permit project-specific information to be stored alongside standard values.

This text should accompany a collection of such files in plain-text form, and it should also be included in all such files as a dataset called "README".

This data

Timing results as of: 2022-03-14 21:56:13 +0000

Git hash: 78afc7978e267ae9d11ab5daf57e6438a56c528b

Generated: 2023-02-28 10:10:43

Generated by: Anne Archibald <Anne.Archibald@nanograv.org>

File contents

Referencing

If you do use this data for something, please reference both the NANOGrav 15-year data release paper and the DOI for this data set.

codecov[bot] commented 1 year ago

Codecov Report

Merging #341 (26b53b8) into master (5ef5ff4) will increase coverage by 0.48%. The diff coverage is 91.64%.

Additional details and impacted files [![Impacted file tree graph](https://codecov.io/gh/nanograv/enterprise/pull/341/graphs/tree.svg?width=650&height=150&src=pr&token=7Sjk8cLA85&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv)](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) ```diff @@ Coverage Diff @@ ## master #341 +/- ## ========================================== + Coverage 88.37% 88.86% +0.48% ========================================== Files 13 15 +2 Lines 3012 3350 +338 ========================================== + Hits 2662 2977 +315 - Misses 350 373 +23 ``` | [Impacted Files](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) | Coverage Δ | | |---|---|---| | [enterprise/pulsar.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9wdWxzYXIucHk=) | `92.08% <85.88%> (+0.02%)` | :arrow_up: | | [enterprise/derivative\_file.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9kZXJpdmF0aXZlX2ZpbGUucHk=) | `88.75% <88.75%> (ø)` | | | [enterprise/h5format.py](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv#diff-ZW50ZXJwcmlzZS9oNWZvcm1hdC5weQ==) | `95.14% <95.14%> (ø)` | | ------ [Continue to review full report at Codecov](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). Last update [5ef5ff4...26b53b8](https://codecov.io/gh/nanograv/enterprise/pull/341?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nanograv).
aarchiba commented 1 year ago

I'm not quite sure what documentation is expected for Enterprise code.

paulthebaker commented 1 year ago

I'm not quite sure what documentation is expected for Enterprise code.

Docstrings for all new things is the best way to do it. But there is a ton of existing code that doesn't yet have docstrings...

Certainly, any function/method that is user facing should have a docstring. For something that is more for internal uses and is pretty clear from the code, I wouldn't fret too much about it.

Something like write_dict_to_hdf5 probably doesn't need a docstring, but if you wanted to add one I wouldn't stop you.

vhaasteren commented 1 year ago

@aarchiba, with an HDF5 file format, it seems pretty straightforward to also allow creation of mock pulsar objects. The reason why you created the HDF5 pulsar class is that we sometimes need independence from PINT or Tempo2. This is even more so with simulations.

The other day I needed to generate an array of 10k pulsars. With tempo2 or PINT that would take up too much memory and would take ages. So I wrote this MockPulsar Enterprise class that is super efficient and still allows all many Enterprise models that I needed to test to be run: most Enterprise functions only require things defined in BasePulsar.

The HDF5 pulsar format seems like the best place to put this kind of functionality, and then it can be saved in a proper file format. Do you see any hiccups right away?

vhaasteren commented 1 year ago

The code in this PR has been converted to a separate package by @AaronDJohnson and myself, which can be found here. There will be a new Enterprise PR that needs to be merged in order for that package to work. Until that is ready, the branch can be found on on my repo

I am closing this PR