Forcing generation scales memory usage unexpectedly

raehik commented 9 months ago

Forcing generation (see lib.data.compute_forcings_and_coarsen_cm2_6()) is done per time point, independent of any other time point. We operate on "lazy" Dask arrays, which only download backing data when scheduled and can stream outputs out to file.

Since we don't need to hold forcings in memory after calculation (we can just write them to file), we should be able to change --ntimes timepoints we compute forcings for without largely impacting memory usage. But that doesn't appear to be the case. When testing with a single Dask worker, peak memory usage roughly doubled between --ntimes 50 and --ntimes 100.

Note that this "should" is reliant on Dask scheduling operations efficiently, which may not be a guarantee. A user can guide it in a few ways. See #107 , where this cropped up.

dorchard commented 8 months ago

@CemGultekin1 have you come across any memory issues with the gz code as well in your own explorations?

dorchard commented 8 months ago

@raehik can you isolate this and make an MWE that we could ask the Dask developers about?

m2lines / gz21_ocean_momentum

Forcing generation scales memory usage unexpectedly #113