pace-neutrons / Horace

Horace is a suite of programs for the visualization and analysis of large datasets from time-of-flight neutron inelastic scattering spectrometers.
https://pace-neutrons.github.io/Horace/stable/
GNU General Public License v3.0
7 stars 4 forks source link

Resolve mix of float and double data in SQW #218

Closed nickbattam-tessella closed 4 years ago

nickbattam-tessella commented 4 years ago

1a) The SQW file stores pixel data as single-precision float data 1b) The SQW file stores image and all other data as double-precision 2) The C++ code holds and processes pixel data in a float vector 3) MATLAB reads this data into a double array (which doubles the size of the object in memory) 4) Modelling and fitting codes use double-precision coefficients

Float data use reduces the memory and storage requirements and reflect the reality of machine resolution (~3sf). It is possible, and will be increasingly likely that, data sets will be too large to fit in memory. Custom paging is already in use to only load the needed pieces of the data array (i.e. there is no need to hold the full array in memory).

Interfaces with external systems (Brille, Euphonic, SpinW) will have an expectation on the data types which we must meet (?what is that).

Mixing float and double in calculation will dynamically truncate the double data to a single precision float value (even if that reduces it to float(infinity)

Questions:

mducle commented 4 years ago

I think that storing the pixel and image data as single-precision is enough. We don't often recursively operate on data which has been operated on that much (I can imaging making a cut of a cut in memory, which had in turned been cut from a data file, which is 2 levels of operations but I can't imaging making a cut of a cut of a cut of a cut etc.) so round-off errors should not be that bad for us.

Modelling codes should probably use double-precision, however - e.g. the eigensolver algorithms tend to operate many many times of the same matrix elements as they iterate to a solution so round-off errors are a real issue here - now this is only required internally in those calculations; the output can be either double or single [it doesn't really matter once the eigenvalues have been obtained because operations after this are not really so sensitive to round-off errors], but tends to be double since it's easier to keep everything as double. Still there is no reason why Horace cannot just truncate it to single when it receives it.

abuts commented 4 years ago

Image data are double as summation often needs to be done in doubles to provide single precision and keep adding many small values on the scale of a large one. And introducing double accomulators to get single result would be too difficult in Matlab. As images are not that big, they are currently double. We did not have a problem in accessing image data, so stored it as double. If large images are routinely used (compartibility with nexus applications), we may reconsider this.

tgperring commented 4 years ago

Memory isn't really the point because moving between float or double only make a difference of a factor of two. The largest datasets already exceed typical memory availability by a factor significantly larger than that, even before allowing for headroom to do even simple calculations. I think that we should operate in memory with double everywhere because the opportunity for accumulating rounding errors in algorithms is always a concern - even if we exhaustively examined all current methods of Horace and workflows, concluded that at the moment we would be OK with single precision, rounding would always loom as a problem for the future, either own developments or users writing their own methods and functions. If memory is a problem then we must deal with it by other means e.g. file backed operations.

Horace stores the pixel array as a float because at the time it was written disk storage was simply too expensive; on the machines Horace was being run on in 2008 I/O was not a hugely dominant factor and only became so as operations became multithreaded and recast in C++. With the current setup of IDAaaS it looks like there is an I/O speed problem of course. Storing as float has caused problems, however - in particular pixels moving into different bins. If it wasn't for the current IDAaaS problems I'd store the pixel array as double without hesitating, and even now it is my inclination: double precision everywhere and we never have any consistency problems.

Some thoughts:

Just a couple of notes on integers:

mducle commented 4 years ago

Just to note with regards to integers that Matlab supports unsigned 8-byte integers uint64 (C++ uint64_t C99 unsigned long long) which would hold numbers up to 2^64=1.8*10^19 and so has a larger range than double whilst taking the same memory (and cannot get rounding errors).

tgperring commented 4 years ago

My understanding is that any operations on IEEE doubles which have been initialised as integers perform exact integer arithmetic so long as no intermediate result exceeds 2^53. There may be reasons why we would want to retain the pix array internally in pixelData as one array rather than several arrays. Overall, I think that 2^53 as opposed to 2^64 is good enough for integers that Horace encounters: the case of an sqw object with a single bin containing 2^53 pixels is a 650 petabyte sqw object (325 petabyte on disk)

In the case of npix,

nickbattam-tessella commented 4 years ago

Meeting nodes captured in ADRs