High memory usage generating composites from ABI/AHI

Plantain commented 2 years ago

Using AHI/ABI readers with Himawari/GOES data and producing composites uses in excess of 24GB of RAM, even for a single worker/thread. I suspect this is more than is necessary and probably more than when it was originally written.

To Reproduce Run https://gist.github.com/Plantain/18afecfc8f6c049aa8fbc7f92e7d8284 , with decompressed Himawari8 full-disk imagery

Expected behavior I don't know what an appropriate amount of memory usage is, but I suspect it is considerably less than 24GB for a single worker. I understand Dask is meant to enable chunking of tasks into smaller size components, and something is not working correctly here.

Actual results Observing memory usage with top shows it consistently using ~24GB of RAM

Environment Info:

OS: Ubuntu 20.04
Satpy Version: satpy==0.31.1.dev146+g764171f0, also 0.31.0
PyResample Version: pyresample==1.22.1
dask==2021.11.2 I also tried with dask 2021.03

Plantain commented 2 years ago

See the memory usage plot from the gist (updated to only profile resource usage)

bokeh_plot

yukaribbba commented 2 years ago

For my experience of GK-2A data, setting PYTROLL_CHUNK_SIZE to 1024 will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.

Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.

djhoese commented 2 years ago

@Plantain Can you try passing tiled=True to save_datasets and see if/how that improves your memory?

Plantain commented 2 years ago

For my experience of GK-2A data, setting PYTROLL_CHUNK_SIZE to 1024 will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.

Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.

If you look at the attached script, that is already set.

Plantain commented 2 years ago

@Plantain Can you try passing tiled=True to save_datasets and see if/how that improves your memory?

Right graph is tiled=False. You can see it uses ~5GB less memory in the final writing phase, but no difference to the peak.

djhoese commented 2 years ago

@Plantain That is a very good data point to have. This shows very clearly that for some reason dask (or something in this processing) is holding on to the memory before writing. I think I'm seeing the same thing in my own testing, but am still trying to narrow down what is allocating what and trying to understand everything that the dask diagnostics are telling me.

What I'm currently working on and what I have figured out are:

When using dask's CacheProfiler I see that one task is continuously holding on to (caching) tasks as processing continues. This could be a misunderstanding of mine or a bug in the way I setup my test script, like dask needing to hold on to all the individual tasks until it can generate a final result. However (see below) I think I've gotten around that.
Passing mask_space=False to the reader cuts off about 1.5GB of memory when loading a single channel. I think there are some optimizations I could do that would help reduce this, but I don't think it is the main issue.
For some reason trying to generate overview takes almost 4 minutes when generating true_color_nocorr takes a little less than 45 seconds. overview should be incredibly simple compared to the true_color_nocorr composite, especially since I'm skipping the enhancing and writing step in my script.
Setting chunk size to 2200 will likely produce the best overall performance as each segment is 2200 rows high. This reduces the number of "mini" chunks since 2200 with a chunk size of 1024 splits into (1024, 1024, 52). It didn't seem to effect overall memory usage for me though.

My Script

```python import os os.environ["OMP_NUM_THREADS"] = "1" os.environ["PYTROLL_CHUNK_SIZE"] = "2200" from datetime import datetime from glob import glob from dask.diagnostics import CacheProfiler, ResourceProfiler, Profiler, visualize import numpy as np import dask import dask.array as da from satpy import Scene class PassThroughStore: def __init__(self, shape, dtype): self._store = np.empty(shape, dtype=dtype) def __setitem__(self, key, value): self._store[key] = value def run(): # name = "true_color" # name = "true_color_nocorr" # name = "overview" name = 0.65 scn = Scene(reader="ahi_hsd", filenames=glob("/data/satellite/ahi/hsd/2330/*FLDK*.DAT"), reader_kwargs={"mask_space": False}) # scn.load([name], calibration="counts", pad_data=False) scn.load([name], calibration="counts", pad_data=True) new_scn = scn # new_scn = scn.resample(resampler="native") print(new_scn[name].shape, new_scn[name].data.numblocks) final_store = PassThroughStore(new_scn[name].shape, new_scn[name].dtype) # new_scn[name].compute() # new_scn[name].data[:4000, :4000].compute() # block_view = new_scn[name].data.blocks # block_view[0, 0].compute() # new_scn[name].data.visualize("profile_ahi_visualize_noresample.svg") da.store(new_scn[name].data, final_store, lock=False) # da.store(new_scn[name].data[:18000, :18000], PassThroughStore(), lock=False) print("Done storing") def run_profiled(): with dask.config.set(num_workers=4), CacheProfiler() as cprof, ResourceProfiler(0.4) as rprof, Profiler() as prof: run() filename = f"profile_ahi_{datetime.now():%Y%m%d_%H%M%S}.html" visualize([prof, rprof, cprof], filename=filename, show=False) cwd = os.getcwd() print("file://" + os.path.join(cwd, filename)) run_profiled() # run() ```

simonrp84 commented 2 years ago

Just an small addition, the IR segments are 550 lines, only band 3 has 2200 line chunks. So maybe setting a chunk size of 550 would be more optimal as that can be shared by all bands?

djhoese commented 2 years ago

I've spent a lot of today updating my test script so I can be a little more flexible with what I load and to save various profile HTML files or performance reports. I've decided to switch to the full true_color plot to see if I can identify issues. I still haven't done the update to how the ahi_hsd reader uses memmap but I'm getting there. I got distracted looking at the task graphs for AHI and ABI true_color generation. In one of @Plantain's plots on slack it looked like the memory was increasing until it started saving to disk. In running my own tests which don't include writing I noticed that this reduction of memory usage seems to line up with when dask has finished computing all the chunks for the various angles:

The first 80% of that task stream (top plot) are the angle generations. After that it starts generating/loading the band data from the input files. It seems once it does that it can finally start removing some of the angle information as it was actually applied to the data and is no longer needed. What's really weird though is I then tried running this with the same thing, but forced the input data chunks so they should have been 1 chunk per file (each file is a segment). With that the profiles now look like this:

So it took much less time and used much less memory and the memory doesn't continuously increase, it goes up and down. My best guesses for why this is is:

We are giving the dask scheduler a better problem to solve. Instead of a ton of angle-based tasks to run and be saturated by it is able to more easily decipher the dependencies and schedule them in a more optimal order.
Somehow these single chunk cases make the memory mapping behave much more friendly and something :magic_wand: magical happens.

Still not completely sure, but I do want to point out that the situation in the first screenshot is the same thing that is happening with the ABI reader, just not as much memory is used.

djhoese commented 2 years ago

So I think my local changes have improved this and we have some other small things we could do. As mentioned above I started noticing that dask was scheduling all the angle-related operations first and wasn't even getting to the other calculations (loading the data from the input files) until all (or almost all) of them were finished. So I played around a bit and tried hacking the modifiers so rayleigh correction would just make an adjustment of - 0 to the data and sunz corrected was just / 1. Since this cut out the use of any of the angles the memory usage went down to ~3GB and the code finished in ~30 seconds. Then I re-enabled the sunz correction and memory usage went up to ~10GB peak and ~60 seconds to complete.

I then refactored the angle generation after I noticed that the cos(sza) generated for sunz correction wasn't using the same angle generation that the rayleigh correction is. This didn't really change much (and it shouldn't) with my current case of only have sunz correction and no rayleigh. So then I updated the rayleigh correction so not call pyspectral but still use the angles refl_cor_band = sunz * satz * ssadiff * 0. This shot the memory usage up to ~30GB at peak with the familiar pyramid of memory usage. Then I updated rayleigh again to call pyspectral's actual code and this brought the peak memory usage up to ~35GB in ~250s. This is pretty good considering previously the last time I ran this I got ~50GB peak memory in ~300s.

The increased memory usage overall makes sense to me. You're including 3 additional angle arrays (sat azimuth, sat zenith, solar azimuth) and the rayleigh correction uses all the angles + the red band + the data being corrected in the same function. That's a lot.

TODO:

Why did only having the sunz correction still make all the angles calculate before getting to the data loading? Is this really just a worst case scenario for dask's scheduler?
Look into map_blocks'ing parts of the rayleigh correction so it can maybe use less memory or depend on less input arrays at once.
Look into a possible duplicate masking done in the RatioSharpeningCompositor since the GenericCompositor (it's parent) already does something similar.

Edit: I'll try to make a pull request later tonight.

djhoese commented 2 years ago

As github shows I made #1909 and #1910 which reduce the memory usage a bit. I've now made the biggest improvement by changing how lon/lats are generated in pyresample. It completes in about the same amount of time, but I went from ~30-35GB of memory peak to ~5.5GB.

The main idea with the fix (I'll be making a pull request this afternoon) is that pyresample generates lons/lats for an AreaDefinition by first generating the x/y coordinate vectors (so 1D) then uses np.meshgrid to generate two 2D arrays from those representing the X and Y coordinates for each pixel in the 2D AreaDefinition. It then passes these to pyproj to be transformed from x/y projection space to lon/lat space. The issue is/was that this makes dask think that all the chunks depend on the same np.arange that was used to generate the projection vectors. An since these arange results are then passed to meshgrid you get a big series of connections on the dask task graph. In principal this doesn't seem to cause any connections that aren't accurate, but it does make it harder for dask to know if it can continue down a specific thread of processing because it thinks it needs to hold on to all these arange and x/y arrays and that they are some how associated with all of the other ones.

I'm going to eat some lunch, do some meetings, and then make a pull request on pyresample for this.

WARNING: I have not actually tried generating geotiffs with the changes I've made and have no idea if they are even still correct...but none of the tests fail.

pnuu commented 2 years ago

Ooh, that's remarkable, lets hope it works all the way!

mraspaud commented 2 years ago

That last improvement is really nice, and will provide a boost to all area-based data!

djhoese commented 2 years ago

Here's what I get with my newest update to the angles PR in satpy and my get_proj_coords PR in pyresample when generating an AHI true_color all the way to a tiled geotiff (260-290 seconds) with 4 workers on my laptop:

Here's ABI with the same situation (<140s):

djhoese commented 2 years ago

@Plantain Do you have a plot of what it looked like with ABI before my changes?

djhoese commented 2 years ago

Nevermind, I switched my pyresample branch and here's what ABI used to look like:

17GB now only uses ~6GB.

djhoese commented 2 years ago

Pyresample 1.22.2 was released on Friday and includes the major improvements shown here. We'll plan to release Satpy this week which will include the other smaller performance improvements. Closing this as those satpy improvements are now in satpy main branch.

pytroll / satpy

High memory usage generating composites from ABI/AHI #1902