scexao-org / vampires_dpp

Tools for processing VAMPIRES data
MIT License
3 stars 3 forks source link

Skip intermediate data products by default #4

Closed mileslucas closed 1 year ago

mileslucas commented 1 year ago

Right now every step of the pipeline is written to take a path to a FITS file in and produce a new FITS file and return the new path. This leads to a lot of data copying, which can cause significant ballooning of working directory sizes because you essentially get a 1:1 copy of the raw data for the following steps

This can easily take a ~200 GB raw dataset and make it take >1 TB of space.

The frame selection and registration already have a somewhat decoupled interface between measuring the metrics/offsets and modifying the data. My path forward is going to look at how to modify the pipeline to only use these files unless a user specifically requests the intermediate FITS files.

Initially I was hesitant to do this by default because it would potentially slow down repeated reductions of the pipeline. Let's look at that case, though. In the case that we already have metrics and offsets measured, we only need to "remake" frame selected or registered data if we want to change our collapse method or something earlier in the pipeline. For most reruns the changes are after this step so we don't really need the registered or selected data on disk. For cases where the selection or registration need redone the files would have to be remade no matter what, so it's time lost either way.

mileslucas commented 1 year ago

This change would also be a good opportunity for the FastPDI workflow to change so that Wollaston states aren't split into separate files. That was kind of a mental idea to begin with in terms of data volume. In essence, all of the frame selection and registration is going to get moved into the collapse part of the pipeline. So, as long as I have a CSV for the metrics and the offsets, I can take an input cube and create a collapsed frame. For the Wollaston prism, this means I can measure the offsets for each beam and then from the same calibrated file produce two collapsed files.