Serialization for embedding in larger (Rubin) files

TallJimbo commented 1 month ago

Rubin's been storing its Piff-generated PSFs via pickle thus far, and we all know that's a terrible idea. We've had a few vague thoughts about what we might do instead, but some of those are finally firming up and it's past time we did something.

We'd like to be able to store arbitrary Piff PSFs in a backwards-compatible way (future Piff versions can read files written by older versions) inside larger files. Most of those files will be FITS, but we don't want to assume they all will be. And while we could have Rubin-side code that introspects Piff objects of the types we actually use and handles the serialization, it be better to do this directly in Piff and more naturally get support for all types of models, interpolants, etc.

Here are some possibilities:

One option is Piff's existing FITS serialization, which we probably should have tried to use instead of pickle from the beginning. In many cases we could give Piff a fitsio.FITS object and let it append some additional HDUs. We'd need to make sure it didn't assume it was writing a full file, but I assume that's either already the case or not hard. And we'd probably need it to report something about what it did write, so that we can have pointers to those HDUs elsewhere in the file. The big downside of this approach is that it does require FITS, and that means in other contexts (and maybe in some FITS contexts where appending HDUs might be problematic) we'd have to write to FITS and then consider that a binary blob to be stuffed into some other file.
If we're going to write a binary blob sometimes (and consider Piff-appended HDUs to be totally opaque in terms of how we document our files for science users in other places), maybe we should just always store Piff PSFs as binary blobs, but make it a Piff method that does that. The first version could even delegate to the FITS serialization, then run that through gzip or something and prepend a serialization version number for backwards compatibility, and if we wanted to rewrite that without delegating to FITS later it'd be a relatively transparent change (as long as we could still read the FITS version in the future).
I've been putting together a new pure-Python library ShoeFits for serialization on top of pydantic. On its own Pydantic lets you define dataclass-like structs and get them mapped to and from JSON automatically (with nice support for JSON schemas and hence format documentation), and what ShoeFits adds is the ability to pull numpy arrays out of those structs and save them elsewhere, with pointers replacing them in the JSON tree that are automatically deferenced on read. It's not on PyPI yet, but it will be (pending formal Rubin DM adoption; this is still something we're discussing on an RFC), so I'm hoping adding it as an (optional) dependency of Piff might be possible. I think this is the best approach to take if you want the Piff representation in Rubin files to be human-readable or self-describing (this is not something DM requires, though it might be nice), and it's probably easier to implement and maintain than a bespoke non-FITS binary blob format.

Rubin DM can definitely contribute effort to any of these (probably mostly @PFLeget, maybe some from @jmeyers314 and me), and we're open to other options too if you have ideas.

rmjarvis commented 1 month ago

Case 1 is already enabled. The UI is

psf._write(fits, extname, logger)

Indeed the normal front end writer, which creates the FITS file, is just implemented as

        with fitsio.FITS(file_name,'rw',clobber=True) as f:
            self._write(f, 'psf', logger)

If 'psf' is too generic, you can change that to something longer, like 'piff_psf' or whatever when you call the underscore version.

rmjarvis commented 1 month ago

Relevant links: _write: https://github.com/rmjarvis/Piff/blob/releases/1.4/piff/psf.py#L731

The corresponding _read on an existing open FITS file: https://github.com/rmjarvis/Piff/blob/releases/1.4/piff/psf.py#L771

rmjarvis commented 1 month ago

Case 3 sounds reasonable. I'd definitely consider any PRs towards that end.

We can also do something like what I do in TreeCorr where I have several writers and readers, so the first step is to create one of these according to the desired file format. And then the actual I/O all works through methods of these classes.

cf. https://github.com/rmjarvis/TreeCorr/blob/releases/5.0/treecorr/writer.py https://github.com/rmjarvis/TreeCorr/blob/releases/5.0/treecorr/reader.py

The front end for this looks like (https://github.com/rmjarvis/TreeCorr/blob/releases/5.0/treecorr/zzcorrelation.py#L191)

        with make_writer(file_name, precision, file_type, self.logger) as writer:
            self._write(writer, None, write_patch_results, write_cov=write_cov)

Then the underscore write functions work with methods of the writer object. cf. https://github.com/rmjarvis/TreeCorr/blob/releases/5.0/treecorr/corr2base.py#L1700

So we could maintain a FITS writer/reader, and something else that Rubin DM could maintain. We could even get fancy and let you maintain that external to the Piff repo if that makes more sense and use a registration step to let you register your writer class to be used by Piff.

TallJimbo commented 1 month ago

It'd be really easy to map case 3 to reader / writer interfaces like the TreeCorr ones, and it does look like that would help a new serialization format share code with the existing FITS one. And while I don't think Rubin DM needs to have something external to the Piff repo, that might help sidestep dependency ordering problems in the future.

So I think I'll start by trying to put together a PR that refactors the FITS I/O into reader / writer classes like the TreeCorr ones, while making sure the methods on those are compatible with what I'd want to implement them with ShoeFits.

If that all goes well, I'll put together a ShoeFits implementation in a Rubin DM repo, and we can discuss whether to upstream it to Piff later.

rmjarvis / Piff

Serialization for embedding in larger (Rubin) files #168