Converting h5 file data format from float to compound datatypes

MichaelMedford commented 4 years ago

Converting h5 file data format from float to compound datatypes

The introduction of additional photometric systems and new columns of data motivates the need for a new file format for our h5 files. This datatype needs to be flexible, as in capable of working when the number of data columns can vary depending on whether the user has selected to add additional photometric systems. The addition of new columns into the h5 files also makes it difficult to keep track of which files contain which columns. We therefore need a file format that self contains the information about the file's contents.

This pull request converts h5 files into hdf5 compound datatypes, including the creation of these files in perform_pop_syn and the reading of these files in calc_events. All changes are hidden from the user, maintaining the same functions and arguments as are currently in master. The comments within the file hopefully explain what is going on, but I will call out a few key changes. Line numbers are approximate.

_bin_lb_hdf5 : creating a comp_dtype_arr (line 1100) The key to this flexible file format is that it is not known when binning the data what columns are in the data. This snippet of code loops over the obj_arr dictionary and, by asking what the names of the columns are, assigns them a data tuple with the key's name and a variable type. This is then wrapped with np.dtype, which can be taken in by hf.create_dataset as a valid datatype. This datatype is compound, meaning that it is not made up of one variable type but by a concatenation of a list of key names and variable types.

Datatype and size of hdf5 datasets (line 1125) In the previous format, the size of an hdf5 dataset was (number of stars, number of columns). The new format has a size of (number of stars,). Note the blank in the tuple, where the second size is undefined. This is because a compound datatype doesn't have a length in the sense of "number of columns". Instead, numpy and h5py have concatenated all of these columns into a single 0-dimensional object. All further references to the hdf5 datasets therefore ask not for dataset.shape[1] to find the number of stars, but instead len(dataset) or dataset.shape[0].

Accessing the hdf5 datasets The hdf5 datasets, once opened from the file, are now loaded into python as numpy recarrays and accessed just the same as astropy tables. They can be indexed into by column name, going from dataset[col_idx['mass']] to dataset['mass']. Changing column names is done with rfn.rename_fields (line 1669), using import numpy.lib.recfunctions as rfn. Combining together slices of these tables (such as combining together the source and lens rows to create the events table) is done using rfn.merge_arrays (line 1672). Adding columns to these datasets is done with rfn.append_fields (line 1676). Saving these datasets to an astropy table at the end of calc_events is done simply be wrapping the object in Table, as the astropy tables object automatically converts numpy recarrays.

Conversion function: convert_h5_array_dtype_to_compound_dtype (converter.py) Updating to this new file format runs the risk of making h5 files generated with the original format difficult to work with. I have created a conversion function convert_h5_array_dtype_to_compound_dtype that takes in the filename of an h5 file written in the original format (base.h5) and produces a new h5 file written in the new format (base.compound_dtype.h5). You may find it useful to rename the old one to a new name as well, specifying it's datatype, but I leave that out of the function as personal preference. I considered writing a function to convert back to the old data format but struggled to image a use case. If this would be helpful, I can write such a function as well.

I want to leave a note about how this new data format effects speed. Speed testing was performed with this new datatype in order to make sure that the improvements to readability and usability where not causing significant increases in compute time. The results of that work found that implementing this new datatype produced in the following effects in catalogs with 1e6 and 1e7 stars and n_proc=4:

23% speed up in perform_pop_syn
14% slow down in calc_events
4% slow down in combined perform_pop_syn and calc_events pipeline execution

In net effect, there is a slight performance hit to the code by introducing this new datatype. Fortunately the total slow down is small in size, and falls on the part of the code that is currently the easiest to parallelize. Ramping up n_proc in calc_events helps to counter this, which can now be done more effectively at scale on the NERSC supercomputer using the run.py script.

MichaelMedford commented 4 years ago

If all of the concerns have been addressed, can we move forward with merging this branch into master? I will then take care of tagging and generating a release. @caseylam @jluastro

MichaelMedford commented 4 years ago

I just reran the test field with master:v1.0.1 and with hdf5_compound, executing perform_pop_syn, calc_events, and refine_events with seed=0. Both produced the same 22 events with identical parameters for all fields. Converting the old results to the new results with the converter also produces identical results! I also reran with seed=1 just in case, same results. @caseylam @jluastro

jluastro / PopSyCLE

Converting h5 file data format from float to compound datatypes #18

Converting h5 file data format from float to compound datatypes