Closed nickbattam-tessella closed 4 years ago
I think that storing the pixel and image data as single-precision is enough. We don't often recursively operate on data which has been operated on that much (I can imaging making a cut of a cut in memory, which had in turned been cut from a data file, which is 2 levels of operations but I can't imaging making a cut of a cut of a cut of a cut etc.) so round-off errors should not be that bad for us.
Modelling codes should probably use double-precision, however - e.g. the eigensolver algorithms tend to operate many many times of the same matrix elements as they iterate to a solution so round-off errors are a real issue here - now this is only required internally in those calculations; the output can be either double or single [it doesn't really matter once the eigenvalues have been obtained because operations after this are not really so sensitive to round-off errors], but tends to be double since it's easier to keep everything as double. Still there is no reason why Horace cannot just truncate it to single when it receives it.
Image data are double as summation often needs to be done in doubles to provide single precision and keep adding many small values on the scale of a large one. And introducing double accomulators to get single result would be too difficult in Matlab. As images are not that big, they are currently double. We did not have a problem in accessing image data, so stored it as double. If large images are routinely used (compartibility with nexus applications), we may reconsider this.
Memory isn't really the point because moving between float or double only make a difference of a factor of two. The largest datasets already exceed typical memory availability by a factor significantly larger than that, even before allowing for headroom to do even simple calculations. I think that we should operate in memory with double everywhere because the opportunity for accumulating rounding errors in algorithms is always a concern - even if we exhaustively examined all current methods of Horace and workflows, concluded that at the moment we would be OK with single precision, rounding would always loom as a problem for the future, either own developments or users writing their own methods and functions. If memory is a problem then we must deal with it by other means e.g. file backed operations.
Horace stores the pixel array as a float because at the time it was written disk storage was simply too expensive; on the machines Horace was being run on in 2008 I/O was not a hugely dominant factor and only became so as operations became multithreaded and recast in C++. With the current setup of IDAaaS it looks like there is an I/O speed problem of course. Storing as float has caused problems, however - in particular pixels moving into different bins. If it wasn't for the current IDAaaS problems I'd store the pixel array as double without hesitating, and even now it is my inclination: double precision everywhere and we never have any consistency problems.
Some thoughts:
It may be that we can compress on the fly, as for a typical inelastic data set the number of non-zero pixels is an order of magnitude smaller. I've not resurrected my old prototype to see how far I got and if I hit any show-stoppers on the way (it was ~6 years ago). This doesn't reduce the size of simulated datasets (where pixels will in general be non-zero)
If we do have to stay with float for the pix array, then the problem of single precision is contained on the whole - the values of the pixel coordinates are transformed into each other but not mixed with other quantities. The signal and error for each individual pixel are unchanged, and so summing is not a process that accumulates rounding errors, so long as we go back to the original signal. (Note: as Alex says, manipulation on dnd objects can accumulate errors, and double arrays here is not a problem generally).
We can recompute the pixel coordinates from the indices for run,m detector, energy bin (this is what calculate_qw_pixels2 does) - Tobyfit uses it for example to get back to unsymmetrised coordinates. This recomputation is done in double. For consistency, we might have to do double=>float=>double, and make sure we use the same algorithm in calculate_qw_pixels2 as is used in gen_sqw.
Just a couple of notes on integers:
npix is stored as a double because the largest integer that can be stored in an IEEE float is 2^24 = 16.8M. In the case where bins are large this could easily be exceeded. A single nxspe on Merlin or LET file has ~10^7 pixels so stored as an sqw file with a single bin (the default at least at one stage) is on the limit of exceeding the maximum. A single bin for a current state-of-the art sqw data file (> 10^10 pixels) exceeds the largest 4 byte integer (2e9). An IEEE double hold integers up to 2^53 = 9*10^15, which gives 5 orders of magnitude headroom compared to the current state-of-the-art file size.
detector index idet in the pix array has a looming problem for future instruments. CSPEC at the ESS will a similar number of detectors as MERLIN - 3e5. But I believe that the design means that there are something like 20 or 30 layers => 1e7 detector elements. Not so far off our current limit of 1.68e7! This suggests at least splitting the pix array internally, or just going to double!
Just to note with regards to integers that Matlab supports unsigned 8-byte integers uint64
(C++ uint64_t
C99 unsigned long long
) which would hold numbers up to 2^64=1.8*10^19 and so has a larger range than double
whilst taking the same memory (and cannot get rounding errors).
My understanding is that any operations on IEEE doubles which have been initialised as integers perform exact integer arithmetic so long as no intermediate result exceeds 2^53. There may be reasons why we would want to retain the pix array internally in pixelData as one array rather than several arrays. Overall, I think that 2^53 as opposed to 2^64 is good enough for integers that Horace encounters: the case of an sqw object with a single bin containing 2^53 pixels is a 650 petabyte sqw object (325 petabyte on disk)
In the case of npix,
Meeting nodes captured in ADRs
1a) The SQW file stores pixel data as single-precision float data 1b) The SQW file stores image and all other data as double-precision 2) The C++ code holds and processes pixel data in a float vector 3) MATLAB reads this data into a double array (which doubles the size of the object in memory) 4) Modelling and fitting codes use double-precision coefficients
Float data use reduces the memory and storage requirements and reflect the reality of machine resolution (~3sf). It is possible, and will be increasingly likely that, data sets will be too large to fit in memory. Custom paging is already in use to only load the needed pieces of the data array (i.e. there is no need to hold the full array in memory).
Interfaces with external systems (Brille, Euphonic, SpinW) will have an expectation on the data types which we must meet (?what is that).
Mixing float and double in calculation will dynamically truncate the double data to a single precision float value (even if that reduces it to float(infinity)
Questions: