Closed MichaelMedford closed 4 years ago
If all of the concerns have been addressed, can we move forward with merging this branch into master? I will then take care of tagging and generating a release. @caseylam @jluastro
I just reran the test field with master:v1.0.1 and with hdf5_compound, executing perform_pop_syn, calc_events, and refine_events with seed=0. Both produced the same 22 events with identical parameters for all fields. Converting the old results to the new results with the converter also produces identical results! I also reran with seed=1 just in case, same results. @caseylam @jluastro
Converting h5 file data format from float to compound datatypes
The introduction of additional photometric systems and new columns of data motivates the need for a new file format for our h5 files. This datatype needs to be flexible, as in capable of working when the number of data columns can vary depending on whether the user has selected to add additional photometric systems. The addition of new columns into the h5 files also makes it difficult to keep track of which files contain which columns. We therefore need a file format that self contains the information about the file's contents.
This pull request converts h5 files into hdf5 compound datatypes, including the creation of these files in
perform_pop_syn
and the reading of these files incalc_events
. All changes are hidden from the user, maintaining the same functions and arguments as are currently in master. The comments within the file hopefully explain what is going on, but I will call out a few key changes. Line numbers are approximate._bin_lb_hdf5 : creating a
comp_dtype_arr
(line 1100) The key to this flexible file format is that it is not known when binning the data what columns are in the data. This snippet of code loops over theobj_arr
dictionary and, by asking what the names of the columns are, assigns them a data tuple with the key's name and a variable type. This is then wrapped withnp.dtype
, which can be taken in byhf.create_dataset
as a valid datatype. This datatype is compound, meaning that it is not made up of one variable type but by a concatenation of a list of key names and variable types.Datatype and size of hdf5 datasets (line 1125) In the previous format, the size of an hdf5 dataset was (number of stars, number of columns). The new format has a size of (number of stars,). Note the blank in the tuple, where the second size is undefined. This is because a compound datatype doesn't have a length in the sense of "number of columns". Instead, numpy and h5py have concatenated all of these columns into a single 0-dimensional object. All further references to the hdf5 datasets therefore ask not for
dataset.shape[1]
to find the number of stars, but insteadlen(dataset)
ordataset.shape[0]
.Accessing the hdf5 datasets The hdf5 datasets, once opened from the file, are now loaded into python as numpy recarrays and accessed just the same as astropy tables. They can be indexed into by column name, going from
dataset[col_idx['mass']]
todataset['mass']
. Changing column names is done withrfn.rename_fields
(line 1669), usingimport numpy.lib.recfunctions as rfn
. Combining together slices of these tables (such as combining together the source and lens rows to create the events table) is done usingrfn.merge_arrays
(line 1672). Adding columns to these datasets is done withrfn.append_fields
(line 1676). Saving these datasets to an astropy table at the end ofcalc_events
is done simply be wrapping the object inTable
, as the astropy tables object automatically converts numpy recarrays.Conversion function:
convert_h5_array_dtype_to_compound_dtype
(converter.py) Updating to this new file format runs the risk of making h5 files generated with the original format difficult to work with. I have created a conversion functionconvert_h5_array_dtype_to_compound_dtype
that takes in the filename of an h5 file written in the original format (base.h5
) and produces a new h5 file written in the new format (base.compound_dtype.h5
). You may find it useful to rename the old one to a new name as well, specifying it's datatype, but I leave that out of the function as personal preference. I considered writing a function to convert back to the old data format but struggled to image a use case. If this would be helpful, I can write such a function as well.I want to leave a note about how this new data format effects speed. Speed testing was performed with this new datatype in order to make sure that the improvements to readability and usability where not causing significant increases in compute time. The results of that work found that implementing this new datatype produced in the following effects in catalogs with 1e6 and 1e7 stars and
n_proc=4
:perform_pop_syn
calc_events
perform_pop_syn
andcalc_events
pipeline executionIn net effect, there is a slight performance hit to the code by introducing this new datatype. Fortunately the total slow down is small in size, and falls on the part of the code that is currently the easiest to parallelize. Ramping up
n_proc
incalc_events
helps to counter this, which can now be done more effectively at scale on the NERSC supercomputer using therun.py
script.