Avoid crashes when reading unfinished files

doliinychenko commented 2 years ago

It often happens that SMASH produces unfinished runs, for example because the time ended on the cluster, or SMASH crashed for some reason. Often information from these unfinished runs is very useful, so it makes sense for the read-out script to consume all the useful information and stop at the right moment without crash.

This little modification allows the read-in script to read input file until the first broken block without crashing.

Signed-off-by: Dmytro Oliinychenko doliinychenko@lbl.gov

elfnerhannah commented 2 years ago

I guess this is done by now, but I would be in favour of the robust solution of just using unfinished files until the last complete event, if that is an easy fix.

AxelKrypton commented 2 years ago

I am not aware of the implementation details but

sometimes the premature ending of the simulation might lead to an output that does not bring the reading program to a crash, but it returns wrong data

is IMHO a scenario to be avoided! 👀 So I definitely agree with @elfnerhannah.

I do not know how an event is printed out and what qualifies it as complete, but this might be a label to be looked for in the beginning or some property like the mentioned sum by @gabriele-inghirami, although I would try to avoid this, since it increases inter-dependency and complexity.

gabriele-inghirami commented 2 years ago

I would be in favour of the robust solution of just using unfinished files until the last complete event, if that is an easy fix

The script read_binary_output.py read blocks, not events, and it is used in a couple of analysis tests, _test/energy_scan/mult_andspectra.py and test/densities/CalcDensityEvolution.py, that should be applied only to correctly fully executed runs. I don't know how many people uses it, it's part of the analysis, not of the main SMASH repository. I personally never used it.

However:

we might still wish to make the user aware of this possible issue and officially recommend somewhere in the documentation to use only complete events (but I assume that, if a user uses this script, {s,}he knows what {s,}he doing, my previous comment was just an opinion express to Dima)
we might wish to write a new utility, not just a fix, to read the binary output that extracts full events, but I don't think that we should necessarily do that and, in particular, not because of this potential problem

@AxelKrypton

SMASH has already a clean system to signal when the output is complete, as the output files have the unfinished suffix until all the events are run and they are renamed just before the program exits. Especially in a cluster, SMASH might end prematurely because not enough time was allocated for the job or because of a node failure, leaving the user with unfinished output files that might still contain information about a lot of events (e.g. 998 out of 1000). Here we are talking about a python script utility that reads the binary output files. Before this fix, depending on how clean or dirty SMASH was forced to end or crashed, the python script might have crashed when attempting to read an unfinished binary file, while now, if it finds a not well formatted block that it is not able to interpret, it handles the exception and continues without losing the information already read. A block is given by the data of the hadrons at a certain output time and it is read by something like:

npart = np.fromfile(bfile, dtype='i4', count=1)[0]
particles = np.fromfile(bfile, dtype=particle_data_type, count=npart)

Now, if SMASH has been stopped because of a SIGTERM signal there should not be any problem with the data consistency in the block. The main issue that I can think of is that one might have data for less events at late times (e.g. one might have 500 unfinished runs stopped at t=12fm because they ran out of time) and one should be a bit careful when normalizing by the number of events, but in most cases this is so large that it should not matter much. The situation might be a more problematic in the case of a SIGKILL signal, because the writing operation of a block might be interrupted before it is finished, leading to an inconsistent data structure. This is the situation that Dima's fix corrects: it the data block is unreadable, it is just discarded and the script continues instead of crashing, thus avoiding to lose the good data read so far. However, if we are dealing with a hardware failure, then I am not sure if we can be confident that atomic operations, like writing a double to the hard drive, are correctly executed. I think that this is a rather extreme and rare case, to produce it I did not just interrupt SMASH, but I also simulated a data corruption by manually deleting a few final bytes. Nevertheless, in this case the numpy routines did not detect the problem and just misinterpreted the data.

Now, the binary output files have a final block line, marked by 'f', one might consider to use it as an additional consistency check to accept a block as valid, instead of being read as another 1 line block type: https://theory.gsi.de/~smash/userguide/2.1/format_binary_.html current code:

   elif (block_type == 'f'):
        n_event = np.fromfile(bfile, dtype='i4', count=1)[0]
        return {'type': block_type,
              'nevent': n_event}
        # got file end block

However, given the context that we are dealing with (a secondary script that it is used only by expert users or, for the analysis, only under safe (i.e.g finished runs) conditions) I don't think that this is a urgent or even a necessary improvement...

akschaefer commented 2 years ago

The script read_binary_output.py read blocks, not events, and it is used in a couple of analysis tests, _test/energy_scan/mult_andspectra.py and test/densities/CalcDensityEvolution.py, that should be applied only to correctly fully executed runs. I don't know how many people uses it, it's part of the analysis, not of the main SMASH repository. I personally never used it.

Just as a side comment. I can say that I basically always use it when analyzing the binary output. It's very convenient. Not sure about everyone else though ...

AxelKrypton commented 2 years ago

@gabriele-inghirami Thanks for taking time to give me some more detailed explanation. To be honest I am still a bit puzzled. If you manually deleted few bytes and the python script reads without complaining and gets wrong data, I guess something is to be fixed. It is definitely a suspicious situation to my taste. Now, you have a valid point that this happens in a scenario in which the user has given an unfinished file in input, so basically one could say this is by default something to do at own risk. But if this is (as it seems) a common if not even sensible thing to do to avoid to loose many events that can be used, then one should avoid the behaviour you found. If there is already a way to mark the end of the event (that line marked with 'f'), one could read it, validate it (to be sure e.g. no field on the line is missing) and if valid declare the event read as valid and usable.

In any case, this is my personal opinion from outside, basically based on the experience that there is only one way to success and definitely a lot more to fail... 😅

stdnmr commented 2 years ago

Just to second Anna here: I also use this script for almost all my analysis.

smash-transport / smash-analysis

Avoid crashes when reading unfinished files #45