Closed balerion closed 3 years ago
btw, the line mentioned above is 634 in processor/DldFlashDataframeCreator.py:
self.daListResult = dask.compute(*daList)
You are right, this should stay lazy till the end. These are all operations which dask should be able to handle, and we should switch to those. I will take a look at it when I am back home.
I looked into this and found it to be more complicated than expected to change. It would require to compute at least the shape of each element in daList, and the easiest way seems to be the way it is already implemented.
Anyway, the issue of computing this is small compared to the fact we are already loading everything in memory. It might actually be faster if we skip dask completely in this step.
As many other issues, this will no longer be there once we make the new readout method, which should be retrocompatible with the old data structure (if it is feasable...).
Considering we are now using the new data structure, any updates on this Steinn?
The new readout no longer suffers from these issues. And making a new readout for the old data seems like a waste of time, as even if not optimal the old system works.
Should someone request it specifically, we can work on it, and reopen this issue. Till then, we can close this.
in the electron dataframe creation, there is a line which causes kernel crashes for very large datasets. I suspect it is due to computing the whole dataframe instead of keeping it lazy:
Is there a way to do the same but using map_partition or similar? Is there a reason why this is not kept lazy?