Implemented a number of key changes to the way DataFrames are stored and loaded.
Problems that this pull request fixes:
Previously, the metadata DataFrame was merged with the main DataFrame prior to being saved to disk. This resulted in a lot of storage space used to save redundant or unnecessary information for most purposes, e.g. cluster membership, "good" flag, etc.
Meanwhile, input parameters to load_df() were being added at runtime based on the filename, which is not ideal because filenames can be modified.
Additionally, there are many input parameters to make_df() that could be modified without changing the output DataFrame filename, meaning that successive runs of make_df() where e.g. flux S/N cuts were toggled on/off would overwrite the previous ones.
Aside from checking the filename, no checks were performed in load_df() to check that the file loaded actually matches the input parameters to make_df(). To get around this, the only way would be to use the df_fname_tag argument, which is not ideal. It would technically be possible to do this by checking the values in the DataFrame itself, but it takes way too long to load a full-sized DataFrame for this to be feasible.
Implemented changes:
make_df() has been heavily modified:
The metadata DataFrame is no longer merged with the main DataFrame before being saved to file. The metadata DataFrame is merged with the full DataFrame prior to the call to add_columns() (because this function requires some of the metadata columns, e.g., distances), but these columns are removed before being saved to file.
pandas.HDFStore is now used to save the files rather than df.to_hdf(). The full DataFrame, the metadata DataFrame, plus a "spaxelsleuth parameters" Series are now saved as separate entries within the file. Crucially, this allows us to record all input arguments to make_df() without significantly increasing the filesize.
A timestamp is now appended to the filenames. This makes it more straightforward to access a specific DataFrame file in load_df if needed.
load_df() has been made much more flexible:
In particular, you can now directly pass in kwargs that were used in make_df() to streamline the process of loading the files.
Additionally, load_df() iterates through all valid .hd5 files in output_path and opens the ss_params Series in order to check for matches with the specified parameters. If multiple files are identified, the user is prompted to select interactively.
Implemented a number of key changes to the way DataFrames are stored and loaded.
Problems that this pull request fixes:
load_df()
were being added at runtime based on the filename, which is not ideal because filenames can be modified.make_df()
where e.g. flux S/N cuts were toggled on/off would overwrite the previous ones.load_df()
to check that the file loaded actually matches the input parameters tomake_df()
. To get around this, the only way would be to use thedf_fname_tag
argument, which is not ideal. It would technically be possible to do this by checking the values in the DataFrame itself, but it takes way too long to load a full-sized DataFrame for this to be feasible.Implemented changes:
make_df()
has been heavily modified:add_columns()
(because this function requires some of the metadata columns, e.g., distances), but these columns are removed before being saved to file.pandas.HDFStore
is now used to save the files rather thandf.to_hdf()
. The full DataFrame, the metadata DataFrame, plus a "spaxelsleuth parameters" Series are now saved as separate entries within the file. Crucially, this allows us to record all input arguments tomake_df()
without significantly increasing the filesize.load_df
if needed.load_df()
has been made much more flexible:kwargs
that were used inmake_df()
to streamline the process of loading the files.load_df()
iterates through all valid .hd5 files inoutput_path
and opens thess_params
Series in order to check for matches with the specified parameters. If multiple files are identified, the user is prompted to select interactively.