Investigate ways to make DataFrame storage more efficient

hzovaro commented 1 year ago

E.g. -

Using a different library like h5py to store the output rather than pandas
Computing "simple" things, e.g. logs of various quantities, at runtime

hzovaro commented 1 year ago

Try:

pytables
sqlite
parquet
apache arrow

hzovaro commented 1 year ago

A good solution may be to store the data using

store = pd.HDFStore(fname)
store.put("data", df_data)
store.get_storer("data").attrs.metadata = metadata
store.close()

and read it back in using

with pd.HDFStore(fname) as store:
    df_data = store["data"]
    metadata = store.get_storer("data").attrs.metadata

With this, we can store metadata directly paired with the data on-disk - much better than assuming metadata, e.g. settings, etc. at runtime and then manually adding them back in. We can still add the metadata as columns at runtime as we've been doing up until now.

hzovaro commented 1 year ago

To do:

[ ] run the Storage methods notebook on the server using the full SAMI DataFrame to see how much disk space would be saved by dropping metadata columns.
[ ] try implementing this in make_sami_df() and load_sami_df().
[ ] in load_sami_df(), add checks to read in the metadata to make sure the file matches the requested inputs, e.g. eline_SNR_min etc.
[x] as it is currently, what columns could be transferred to a metadata dict? e.g. flags, etc.? If these are removed from the DataFrame, how would they be transferred between functions that use them, e.g. those in dqcut? Just pass as kwargs? Currently, flags etc. are passed to add_columns() as kwargs and are manually added as columns to the DataFrame in the last few lines - so this won't be a problem.**

hzovaro commented 9 months ago

Updated to-do list Feb 2024:

[x] Store metadata, e.g. spaxelsleuth settings, etc. in a Series and save it as a separate attribute in the HDF file. (this will require changes to io.make_df)
- [x] Remove lines in add_columns() that add these params to the DataFrame.
[x] Update io.read_df() to open metadata & scan to double-check that input parameters match what is in the file
- [x] to check that this works with kwargs - make 2 DataFrames w/ different settings that are not recorded in the DataFrame
[x] Instead of merging the metadata df with the main DataFrame, store a copy in the HDF file and merge at run time to save disk space
[x] Make sure to remove _storage_improvements from the filename
[x] Once tests run without errors, update the references.
[x] Should we return the ss_params Series on load_df()? Perhaps return on an optional argument?
[x] Change name of 'Spaxelsleuth params' to a simpler string to avoid warnings
[x] Make sure integration tests run
[x] Run examples to be safe
[x] Remove output_path arg from load_df()

Incidental changes:

I am no longer going to save the fields __lzifu_ncomponents, __use_lzifu_fits or debug in the DataFrame output by load_df() as they are not really necessary.
Upon load, DataFrames will be sorted on "ID", "x (projected, arcsec)", "y (projected, arcsec)" to make it easier to parse contents.

New philosophy for reading/writing HDF files:

Writing:

all input parameters are saved in the form of a Pandas Series within the HDF file. Only a subset of parameters are saved in the filename.
Always append a timestamp to the filename.

Reading:

if a user doesn't specify an input arg, then it assumes that the user "doesn't care" about that particular parameter and will not check for that keyword when trying to locate the correct DataFrame.
if multiple "valid" DataFrames are found, allow the user to interactively select which one they want - print the parameters to the screen (inc. timestamp).

New tasks (do this all in the testing environment to avoid accidentally overwriting stuff we might need):

[x] Write a separate function for locating all matching HDF files.
[x] Add timestamps to all DataFrames on save.
[x] ~~Implement code in make_df to overwrite existing DataFrames with precisely matching parameters.~~
[x] Implement code in load_df to interactively list parameters for all matching files in case of multiple hits.
[x] Write unit test(s) for this new functionality.
[x] Tidy up implementation of ss_params in make_df().
[x] Make sure integration/regression tests still work.
[x] Before regression testing: overwrite reference DataFrames.
[x] Make tests work on GitHub.

hzovaro / spaxelsleuth

Investigate ways to make DataFrame storage more efficient #14