HDF5 file needs to be less fragile

joefowler commented 8 years ago

Original report by Joseph Fowler (Bitbucket: joe_fowler, ).

Picking up on discussion thread started in issue #62: we need to ensure that our MASS HDF5 scratch space and stored-analysis file is a lot more robust.

To answer the question by @oneilg: I'm using the mass.TESGroupHDF5("filename.hdf5") constructor for your new object, using the latest code in the develop branch.

Yes, this is an enormous--possibly fatal--hit to usability. I'm also unsure what it means, though. I am not actively adding to the HDF5 file when these problems happen. I'm doing a fit that fails for reasons I cannot yet figure out, but it fails by using all allowed system memory. I assume it's the fit's fault, but I don't know. I suppose it could be the fault of the h5py library. I cannot Ctrl-C out of it, but have to do a hard kill on the python process. After that kill, the HDF5 is corrupted.

Here's an idea. Don't know how useful it is. Suppose the mass objects TESGroup and TESGroupHDF5 open the HDF5 file as read-only in normal operations. We use a special interface for updating operations:

#!python

def update_dataset(self, dset, new data):
    assert dset in self.hdf5_file   # or whatever verifies that dset is a dataset under the self.hdf5_file
    fname = self.hdf5_file.filename
    dsetname = dset.give_me_your_complete_pathname()
    self.hdf5_file.close()
    with open(fname, "a") as f:
        d = f[dsetname]
        d[:] = newdata[:]  # obviously requires that new and old data have same type, length, etc.
    self.hdf5_file = h5py.File(fname, "r")

This could be elaborated on for operations that add new datasets, or resize existing ones, and so on. Of course, the initial setup of the HDF5 file would mean opening for append, creating all the needed groups and default dataset objects, closing, and reopening for read-only.

Related idea: create our own object ProtectedHDF5 that guards the underlying HDF5 file in a similar way. Thoughts? Let me add @jojo978 to this discussion, even though he is on vacation.

joefowler commented 8 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

See this interesting argument for a contrarian view on HDF5. The author comes shockingly close to saying that his solution is dirfiles without actually saying it. Probably he hasn't heard of it.

joefowler commented 8 years ago

Original comment by Charles James Titus (Bitbucket: cjtitus, GitHub: cjtitus).

Seconding this sentiment. This happens for us when iPython crashes (and I think sometimes semi-randomly), even if it was not doing an operation on the HDF5 file at the time. I tend to do a lot of work over SSH right now, because it's easier than moving data, so crashes happen whenever the SSH connection hiccoughs. We have a toggle in our scripts to just delete all the HDF5 files and start again, because of the frequency with which that is necessary. I wouldn't cry if you moved away from HDF5 files.

Following some of the threads in #62, I will also say that it's a bit silly how long it can take with forceNew=False just to go through the motions of checking if everything is filtered/calibrated/etc. Don't know if this is HDF5's fault. Just my two cents from the SSRL side.

Actually, probably the most important remark from us: we have so much data that eventually we would like to store it exclusively in its reduced form. That reduced form should also be the only thing that beamline users take with them. Right now, the reduced data is the HDF5 file. Currently, I would never ever delete the raw data and trust that it will be fine to store things in an HDF5 file, and I would never just hand a user an HDF5 file with their reduced data in it. It just gets corrupted with an alarming frequency, and biologists/chemists don't know what an HDF5 file is. This is something that we solve right now by essentially just binning our data in some useful way and writing that to disk in a text format, but it would be really really nice to have a stable format that we can hand to users, and use ourselves for long term storage.

joefowler commented 8 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

Jamie: thanks very much for the input. I don't care if chemists know what an HDF5 file is, but I very much care if they (the files, not the chemists) are readily corrupted during normal analysis use!

I have no idea where to go with this. Our experience with HDF5 has been a combination of wonderful and bad. Our corruption problem seems to go like this: we open a file filled with useful stuff, and we want to alter what's there or add to it. If we crash Python, then the previously existing, useful stuff is now ruined. HDF5 1.10 is excited to announce one-writer multiple-reader capability, but that doesn't solve this problem. Our problem is crashing of the one-writer.

Here are 4 ideas that I have, all of which totally suck for the reasons listed below. These are listed from bad to worse, in my opinion.

Idea 1: a few analysis passes, never going backwards

One approach might be to declare a very short set of analysis "passes". For example:

Data summarization (e.g., rise time, or pulse RMS) and optimal filtering
Data correction (drift, phase, and possibly time-drift)
Calibration (peak fits, cal curves, conversion to energy)

Then the hard and fast rule can be that code in POPE ("Pass One Pulse Examination") treats the pass-1 HDF5 file as read/write but all earlier files as read-only. But PTUI ("Pass Two User Interface") treats the pass-1 file as read-only, and so on. This should act as a firewall to keep corruption from spreading too far backwards in the analysis. However, it has the annoying consequence that, say, if you're doing Pass 3, then you don't have access to pass 1 results. This would mean having an object to represent an analysis in pass 1, another to represent pass 2, and another for pass 3. And if you have all three of them open, then you are still susceptible to the same Achilles heel of corruption. I think this could be made to work, but it would be complicated and would only reduce the amount of data corruption, not eliminate it.

Idea 2: wrap the native h5py.File for protection

Above, I mentioned a different idea, for a ProtectedHDF5 object, but I think that's no good. The problem is that we have thousands of HDF5 data sets already bound to python names like data.channel[315].p_filt_value. If I want to alter any of them, then it's too ponderous to close the read-only file, open it for writing, write the alterations, close it, and open again as read-only. Well, that part might not be so bad, but what about the 999 other data sets? They are now invalid references to objects inside a file that's been closed. And no, you can't separately open a file as read-write if it's already open as read-only (I just tried).

Idea 3: use np.ndarray internally and only store into HDF5 as needed

We could stop having references to HDF5 data sets in our massive MASS objects. Instead, they go back to being plain np.ndarray objects. This is how MASS was a couple of years ago. We could use a special call like data.stash() to store everything as HDF5 data sets. I'm afraid that this is totally non-scaleable, because now HDF5 would no longer be able to handle the lovely, transparent work of freeing up memory and mapping memory to disk as needed.

Idea 4: use getdata in place of HDF5

I love the simplicity of the dirfile format and the getdata library used in CMB work. It has the same problem as the previous idea: that it doesn't do all the sweet memory management that the HDF5 library does. Also, it makes a crap-load of files and has the core assumption that all fields are filled at a multiple of some "heartbeat". So while it's easy to store the pulse height and rise time for a channel, because there are N of each field, it's a challenge to also store the calibration curves in the same hierarchy, because they might have only 8 entries and 8 != N.

joefowler commented 8 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

Actually, a variation on Ideas 1+2, above, might be workable.

Suppose we treat the TESGroup as having an immutable, underlying read-only HDF5 file. That offers us the memory management advantages we have now--that all our huge sets of data are read from memory when possible, or from disk when necessary.

But the data sets are not true h5py.Dataset. Instead, they are a subclass that we write. If we alter the contents, then they are marked "dirty" somehow. At various points (after doing a loop for some analysis step over all TESs, say), we would issue a command to flush all the dirty Dataset objects. This would not be super fast, as it involves closing the file, open for writing, update the data, close, and open read-only. Then we'd have to hook up every Dataset with the right data set in the file.

joefowler commented 8 years ago

Original comment by Charles James Titus (Bitbucket: cjtitus, GitHub: cjtitus).

I agree, all these ideas suck for various reasons.

Maybe I should make one more comment about the SSRL needs, so that you can decide whether or not to take our opinions into consideration. Corruption is annoying, but not a gigantic problem while we're getting from step 1 to step 3. As long as we have the data, we can always re-do things and it just takes re-running the script. Our problem is that after we have gotten to step 3, we need to use that data to do science, and there it must be eternal and un-corruptable, so that we can declare ourselves done with filtering/cuts/calibration, and focus on analysis.

Once we have deleted the original data, what we really need is read-only data that is compact, doesn't include extraneous information, and will never corrupt.

Looking at all these various sucky options, I have started to consider that maybe our SSRL needs in this regard could be decoupled - we could keep the HDF5 for analysis, and then once we are done with step 3 we transfer all the data into our own read-only archival format. If that would make sense, feel free to disregard all of my needs/complaints, except of course for the general observation that HDF5 corrupts often, and that this sucks for everyone.

joefowler commented 8 years ago

Original comment by Galen O'Neil (Bitbucket: oneilg, GitHub: oneilg).

Jamie,

My impression is the HDF5 format is a good choice for your final file for users, independent of the problems we have with it for MASS. The corruption of HDF5 files results from holding references to open files (probably with writes to them) then letting python crash. As long as your users just treat it as read only, they shouldn't have any problems. You probably want to make a new "final user file" after your analysis which is independent of the mass HDF5 file and contains less info (eg no p_filt_phase). Probably you should keep an independent copy as well.

FWIW I think atmospheric chemists are familiar with HDF5.

Joe, I of your ideas, I think 1,2 and 4 seem fairly reasonable. As you say, 3 is not viable given the memory use.

It appears we are not the first people to run into problems with HDF5, this was the largest discussion I found googling "hdf5 corruption": https://news.ycombinator.com/item?id=10858189

joefowler commented 8 years ago

Original comment by Galen O'Neil (Bitbucket: oneilg, GitHub: oneilg).

Given that I have a viable POPE in Mass2.jl, maybe we should try pushing in that direction and see how it feels?

joefowler commented 8 years ago

Original comment by Galen O'Neil (Bitbucket: oneilg, GitHub: oneilg).

YANFF (shouldn't be too hard to figure out) http://www.sciencedirect.com/science/article/pii/S2213133715000645

This sounds a lot like dir files, and what I would have made up if I were replacing ljh files. It's basically YAML plus binary data, plus references and a few more features. The binary data can be in the same file or in seperate files. It appears to support transformations, though I'm not totally sure (so p_filt_value_dc could just be a transformation of p_filt_value and other values, or calibrations could be represented as polynomial transformation perhaps, reducing file size).

I'm not sure we should jump on it, as it's pretty new and seems to only have one implementation (though HDF5 is old with only one). But it seemed worth sharing.

joefowler commented 8 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

Jamie: I agree 100% with Galen. If you wish to create a data product that will be used in read-only mode by end users, I think HDF5 is a great choice. I agree that you'll want to create some new "final user file" that might have nothing at all to do with the MASS file and would surely contain far less information (all the intermediate computation results are jettisoned). HDF5 offers a bunch of nice features that are probably useful for you and your end users:

Cross-platform standardization of the files.
File is standardized in a way that should make it readable by far-future versions of HDF5.
Libraries available for Windows, Linux, OSX, and accessible from Matlab, Python, Octave, Julia, R, and surely almost any other scientific software that a chemist could think of using.
Self-describing files, with hooks to hang metadata anywhere, yet a familiar data structure interface analogous to directories and files.
Lossless compression available and easy to use (transparent to the users reading the file).

Our bad experience, as Galen says, consists wholly of problems when an HDF5 file is open for writing/appending and the entire Python process crashes. That would not be an issue for the use case you're describing.

I won't comment on the ASDF format that Galen pointed out in the previous comment until I can read the paper fully, except to say that it's exciting! I'm a former astrophysicist, and I know that FITS (the thing that ASDF replaces) is a few decades overdue for a complete replacement.

joefowler commented 8 years ago

Original comment by Charles James Titus (Bitbucket: cjtitus, GitHub: cjtitus).

It sounds like a read-only HDF5 would be the correct option for Sang Jun and me. It looks like MASS always opens HDF5 with append, so we'll write our own code to dump the useful data from NIST's HDF5 files into our own HDF5 which is packaged for end users, and decouple the science analysis from MASS. This is probably the best thing to do in any case.

Thanks for the advice.

joefowler commented 8 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

Galen: the article you linked to about the discussion of HDF5 corruption is a good one. I know, because I found it and pointed it out already in this thread's first comment as "this interesting argument".

The new astrophysics format (ASDF) sounds great. I read the paper carefully and was quite impressed. People thinking of designing a new data format from scratch should be required to read that paper! Like you say, I think it is especially appealing for if/when we decide that LJH has to go. However, it doesn't seem to be designed exactly as a "scratch pad" the way we are now trying to use HDF5, but it might work well for that purpose, too, for all I know.

joefowler commented 6 years ago

Original comment by Charles James Titus (Bitbucket: cjtitus, GitHub: cjtitus).

Not sure if you guys still hate HDF5, but I've been getting more aggressive about trying to stop file corruption. A friend who does long-running MCMCs mentioned that the emcee python package uses h5py to store chains to disk, which involves tons and tons of writes. Apparently there's never any corruption. I checked out the emcee h5py backend code, https://github.com/dfm/emcee/blob/master/emcee/backends/hdf.py to see what they're doing, and it turns out to be pretty simple. They never keep the hdf5 file open, and instead aggressively use python's "with" statement. For example, here is the code that writes one step to disk:

#!python

def save_step(self, coords, log_prob, blobs, accepted, random_state):
        """Save a step to the file
        Args:
            coords (ndarray): The coordinates of the walkers in the ensemble.
            log_prob (ndarray): The log probability for each walker.
            blobs (ndarray or None): The blobs for each walker or ``None`` if
                there are no blobs.
            accepted (ndarray): An array of boolean flags indicating whether
                or not the proposal for each walker was accepted.
            random_state: The current state of the random number generator.
        """
        self._check(coords, log_prob, blobs, accepted)

        with self.open("a") as f:
            g = f[self.name]
            iteration = g.attrs["iteration"]

            g["chain"][iteration, :, :] = coords
            g["log_prob"][iteration, :] = log_prob
            if blobs is not None:
                g["blobs"][iteration, :] = blobs
            g["accepted"][:] += accepted

            for i, v in enumerate(random_state):
                g.attrs["random_state_{0}".format(i)] = v

            g.attrs["iteration"] = iteration + 1

Their "open" method is just

#!python

    def open(self, mode="r"):
        if self.read_only and mode != "r":
            raise RuntimeError("The backend has been loaded in read-only "
                               "mode. Set `read_only = False` to make "
                               "changes.")
        return h5py.File(self.filename, mode)

As you can see, they just open the file anew every time they want to read/write to it. This is way way safer, and the overhead on opening an HDF5 file is pretty low, because an open doesn't actually have to read much into memory (probably just the overall file header?).

I think this could probably be done for MASS, but it would be a lot of work, but it would be a lot less work than switching away from HDF5, and anecdotally this works pretty well at preventing all forms of corruption.

Just something to consider, if you are still thinking about ways to use HDF5 more stably.

joefowler commented 4 years ago

Original comment by Galen O'Neil (Bitbucket: oneilg, GitHub: oneilg).

I think we can close this, as new work is aimed primarily at off files which should be less fragile due to not using them as scratch space.

joefowler commented 4 years ago

Original comment by Joseph Fowler (Bitbucket: joe_fowler, ).

I agree with Galen. This has rarely been an issue for us in Boulder, and the OFF analysis procedure should make it come up even less (or not at all).

usnistgov / mass