momentoscope / hextof-processor

Code for preprocessing data from the HEXTOF instrument at FLASH, DESY in Hamburg (DE)
https://hextof-processor.readthedocs.io/en/latest/
GNU General Public License v3.0
7 stars 4 forks source link

possible improvement of laziness in createDataframePerElectron #65

Closed balerion closed 3 years ago

balerion commented 3 years ago

in the electron dataframe creation, there is a line which causes kernel crashes for very large datasets. I suspect it is due to computing the whole dataframe instead of keeping it lazy:

self.daListResult = dask.compute(*daList)
a = np.concatenate(self.daListResult, axis=1)
da = dask.array.from_array(a.T, chunks=self.CHUNK_SIZE)
cols = self.ddColNames
self.dd = dask.dataframe.from_array(da, columns=cols)

Is there a way to do the same but using map_partition or similar? Is there a reason why this is not kept lazy?

balerion commented 3 years ago

btw, the line mentioned above is 634 in processor/DldFlashDataframeCreator.py:

self.daListResult = dask.compute(*daList)
steinnymir commented 3 years ago

You are right, this should stay lazy till the end. These are all operations which dask should be able to handle, and we should switch to those. I will take a look at it when I am back home.

steinnymir commented 3 years ago

I looked into this and found it to be more complicated than expected to change. It would require to compute at least the shape of each element in daList, and the easiest way seems to be the way it is already implemented.

Anyway, the issue of computing this is small compared to the fact we are already loading everything in memory. It might actually be faster if we skip dask completely in this step.

As many other issues, this will no longer be there once we make the new readout method, which should be retrocompatible with the old data structure (if it is feasable...).

zain-sohail commented 3 years ago

Considering we are now using the new data structure, any updates on this Steinn?

steinnymir commented 3 years ago

The new readout no longer suffers from these issues. And making a new readout for the old data seems like a waste of time, as even if not optimal the old system works.

Should someone request it specifically, we can work on it, and reopen this issue. Till then, we can close this.