alternatives to stackEddy

lanl / NEONiso

R package for calibrating NEON atmospheric isotope data

https://lanl.github.io/NEONiso/

GNU General Public License v3.0

2 stars 2 forks source link

alternatives to stackEddy #86

Closed rfiorella closed 1 year ago

rfiorella commented 1 year ago

performance using neonUtilities::stackEddy really suffers (at least for me!) when called repeatedly on large datasets, as in NEONiso.

I suspect this is due to robust error handling or difficulties in preallocating arrays in stackEddy. This may not be necessary for the specific, well defined use case it is for in NEONiso if there are faster alternatives. It may also be a fundamental limitation of h5read, which is slow.

Two potential paths forward: 1) replace stackEddy with a faster (more dangerous?) method 2) find ways to optimize stackEddy when used on really large datasets.

Any suggestions @cklunch @ddurden @ndurden @cflorian?

cklunch commented 1 year ago

@rfiorella Thanks for the heads up about this! This is very timely for me, a couple of thoughts:

I'm currently working on a fairly broad update to neonUtilities (aiming for release in early July), so now is a great time to be thinking about improvements to stackEddy().
I'm also currently working on untangling an unrelated internal problem connected to use of high-efficiency data-stacking tools - you're very right that those methods can be dangerous! So I've been educating myself on the tradeoffs involved.

Part of what's tricky with stackEddy() is that it involves joining as well as stacking, but I can definitely take a look at where it might be made more efficient. I'll keep you posted.

rfiorella commented 1 year ago

Thanks @cklunch!

Not sure if this is useful or new information - but I did some memory profiling on a workflow that I noticed this issue in, and seems like the base::merge call is a good target here

rfiorella commented 1 year ago

This issue will be resolved with the release of neonUtilities=2.3.0, though a few changes will need to be made to NEONiso functions to make use of the new capabilities (e.g., add use_fasttime arguments).

Files that need to be updated:

[x] restructure_data
[x] calibrate_water

Thanks @cklunch for working on stackEddy performance with me!