hzovaro / spaxelsleuth

A package for analysing data from large integral field unit surveys such as the SAMI and Hector Galaxy Surveys.
MIT License
1 stars 1 forks source link

Investigate ways to make DataFrame storage more efficient #14

Closed hzovaro closed 9 months ago

hzovaro commented 1 year ago

E.g. -

hzovaro commented 1 year ago

Try:

hzovaro commented 1 year ago

A good solution may be to store the data using

store = pd.HDFStore(fname)
store.put("data", df_data)
store.get_storer("data").attrs.metadata = metadata
store.close()

and read it back in using

with pd.HDFStore(fname) as store:
    df_data = store["data"]
    metadata = store.get_storer("data").attrs.metadata

With this, we can store metadata directly paired with the data on-disk - much better than assuming metadata, e.g. settings, etc. at runtime and then manually adding them back in. We can still add the metadata as columns at runtime as we've been doing up until now.

hzovaro commented 1 year ago

To do:

hzovaro commented 9 months ago

Updated to-do list Feb 2024:

Incidental changes:

  1. I am no longer going to save the fields __lzifu_ncomponents, __use_lzifu_fits or debug in the DataFrame output by load_df() as they are not really necessary.
  2. Upon load, DataFrames will be sorted on "ID", "x (projected, arcsec)", "y (projected, arcsec)" to make it easier to parse contents.

New philosophy for reading/writing HDF files:

Writing:

  1. all input parameters are saved in the form of a Pandas Series within the HDF file. Only a subset of parameters are saved in the filename.
  2. Always append a timestamp to the filename.

Reading:

  1. if a user doesn't specify an input arg, then it assumes that the user "doesn't care" about that particular parameter and will not check for that keyword when trying to locate the correct DataFrame.
  2. if multiple "valid" DataFrames are found, allow the user to interactively select which one they want - print the parameters to the screen (inc. timestamp).

New tasks (do this all in the testing environment to avoid accidentally overwriting stuff we might need):