larsgeb / hmclab

BSD 3-Clause "New" or "Revised" License
23 stars 0 forks source link

Example discussion: How should we store samples during sampling? #1

Closed larsgeb closed 5 years ago

larsgeb commented 5 years ago

As an example on how I think we could use Github and to keep our email inboxes empty.

Should we write out samples sequentially or all at the same time?

  1. Writing out sequentially requires us to write to text file (expensive, slow).
  2. Writing samples out at the end of sampling requires us to keep samples in RAM, which is infeasible for high dimensional model space or a large amount of samples. It does however allow for the use of the .npy file format.

Do we even want to store all samples, or should we thin on-line (keep only every n-th sample)?

I propose text files, and at the end of sampling convert them to .npy.

inverseproblem commented 5 years ago

I'd like to think fairly big ;), i.e., to make this code suitable for high-dimensional problems, at least to some extent, so, I'd suggest to use a third strategy: write sequentially binary files.

This can be achieved, for instance, using the HDF5 format, which I've been using extensively. The h5py module is easy to use and works pretty well, and would allow us to save binary files with a filesystem-like structure inside: each sample is a "dataset" inside a single "big" file, along with all the other information we want to save, for instance, acceptance rate, potential and kinetic energy values, etc.. We would end up with a single file containing everything, but where it's easy to access a specific information without loading the entire file.

Usually I "thin" online a little bit, both to save space and for speed (I/O is slow...). Maybe we could have a user-defined parameter "save every N iterations"?

larsgeb commented 5 years ago

Both the file format and the tuning option seem like great ideas!