Closed Plantain closed 2 years ago
See the memory usage plot from the gist (updated to only profile resource usage)
For my experience of GK-2A data, setting PYTROLL_CHUNK_SIZE
to 1024
will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.
Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.
@Plantain Can you try passing tiled=True
to save_datasets
and see if/how that improves your memory?
For my experience of GK-2A data, setting
PYTROLL_CHUNK_SIZE
to1024
will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.
If you look at the attached script, that is already set.
@Plantain Can you try passing
tiled=True
tosave_datasets
and see if/how that improves your memory?
Right graph is tiled=False. You can see it uses ~5GB less memory in the final writing phase, but no difference to the peak.
@Plantain That is a very good data point to have. This shows very clearly that for some reason dask (or something in this processing) is holding on to the memory before writing. I think I'm seeing the same thing in my own testing, but am still trying to narrow down what is allocating what and trying to understand everything that the dask diagnostics are telling me.
What I'm currently working on and what I have figured out are:
CacheProfiler
I see that one task is continuously holding on to (caching) tasks as processing continues. This could be a misunderstanding of mine or a bug in the way I setup my test script, like dask needing to hold on to all the individual tasks until it can generate a final result. However (see below) I think I've gotten around that.mask_space=False
to the reader cuts off about 1.5GB of memory when loading a single channel. I think there are some optimizations I could do that would help reduce this, but I don't think it is the main issue.overview
takes almost 4 minutes when generating true_color_nocorr
takes a little less than 45 seconds. overview
should be incredibly simple compared to the true_color_nocorr composite, especially since I'm skipping the enhancing and writing step in my script.Just an small addition, the IR segments are 550 lines, only band 3 has 2200 line chunks. So maybe setting a chunk size of 550 would be more optimal as that can be shared by all bands?
I've spent a lot of today updating my test script so I can be a little more flexible with what I load and to save various profile HTML files or performance reports. I've decided to switch to the full true_color
plot to see if I can identify issues. I still haven't done the update to how the ahi_hsd reader uses memmap but I'm getting there. I got distracted looking at the task graphs for AHI and ABI true_color generation. In one of @Plantain's plots on slack it looked like the memory was increasing until it started saving to disk. In running my own tests which don't include writing I noticed that this reduction of memory usage seems to line up with when dask has finished computing all the chunks for the various angles:
The first 80% of that task stream (top plot) are the angle generations. After that it starts generating/loading the band data from the input files. It seems once it does that it can finally start removing some of the angle information as it was actually applied to the data and is no longer needed. What's really weird though is I then tried running this with the same thing, but forced the input data chunks so they should have been 1 chunk per file (each file is a segment). With that the profiles now look like this:
So it took much less time and used much less memory and the memory doesn't continuously increase, it goes up and down. My best guesses for why this is is:
Still not completely sure, but I do want to point out that the situation in the first screenshot is the same thing that is happening with the ABI reader, just not as much memory is used.
So I think my local changes have improved this and we have some other small things we could do. As mentioned above I started noticing that dask was scheduling all the angle-related operations first and wasn't even getting to the other calculations (loading the data from the input files) until all (or almost all) of them were finished. So I played around a bit and tried hacking the modifiers so rayleigh correction would just make an adjustment of - 0
to the data and sunz corrected was just / 1
. Since this cut out the use of any of the angles the memory usage went down to ~3GB and the code finished in ~30 seconds. Then I re-enabled the sunz correction and memory usage went up to ~10GB peak and ~60 seconds to complete.
I then refactored the angle generation after I noticed that the cos(sza) generated for sunz correction wasn't using the same angle generation that the rayleigh correction is. This didn't really change much (and it shouldn't) with my current case of only have sunz correction and no rayleigh. So then I updated the rayleigh correction so not call pyspectral but still use the angles refl_cor_band = sunz * satz * ssadiff * 0
. This shot the memory usage up to ~30GB at peak with the familiar pyramid of memory usage. Then I updated rayleigh again to call pyspectral's actual code and this brought the peak memory usage up to ~35GB in ~250s. This is pretty good considering previously the last time I ran this I got ~50GB peak memory in ~300s.
The increased memory usage overall makes sense to me. You're including 3 additional angle arrays (sat azimuth, sat zenith, solar azimuth) and the rayleigh correction uses all the angles + the red band + the data being corrected in the same function. That's a lot.
TODO:
map_blocks
'ing parts of the rayleigh correction so it can maybe use less memory or depend on less input arrays at once.Edit: I'll try to make a pull request later tonight.
As github shows I made #1909 and #1910 which reduce the memory usage a bit. I've now made the biggest improvement by changing how lon/lats are generated in pyresample. It completes in about the same amount of time, but I went from ~30-35GB of memory peak to ~5.5GB.
The main idea with the fix (I'll be making a pull request this afternoon) is that pyresample generates lons/lats for an AreaDefinition by first generating the x/y coordinate vectors (so 1D) then uses np.meshgrid
to generate two 2D arrays from those representing the X and Y coordinates for each pixel in the 2D AreaDefinition. It then passes these to pyproj to be transformed from x/y projection space to lon/lat space. The issue is/was that this makes dask think that all the chunks depend on the same np.arange
that was used to generate the projection vectors. An since these arange
results are then passed to meshgrid
you get a big series of connections on the dask task graph. In principal this doesn't seem to cause any connections that aren't accurate, but it does make it harder for dask to know if it can continue down a specific thread of processing because it thinks it needs to hold on to all these arange and x/y arrays and that they are some how associated with all of the other ones.
I'm going to eat some lunch, do some meetings, and then make a pull request on pyresample for this.
WARNING: I have not actually tried generating geotiffs with the changes I've made and have no idea if they are even still correct...but none of the tests fail.
Ooh, that's remarkable, lets hope it works all the way!
That last improvement is really nice, and will provide a boost to all area-based data!
Here's what I get with my newest update to the angles PR in satpy and my get_proj_coords PR in pyresample when generating an AHI true_color all the way to a tiled geotiff (260-290 seconds) with 4 workers on my laptop:
Here's ABI with the same situation (<140s):
@Plantain Do you have a plot of what it looked like with ABI before my changes?
Nevermind, I switched my pyresample branch and here's what ABI used to look like:
17GB now only uses ~6GB.
Pyresample 1.22.2 was released on Friday and includes the major improvements shown here. We'll plan to release Satpy this week which will include the other smaller performance improvements. Closing this as those satpy improvements are now in satpy main
branch.
Using AHI/ABI readers with Himawari/GOES data and producing composites uses in excess of 24GB of RAM, even for a single worker/thread. I suspect this is more than is necessary and probably more than when it was originally written.
To Reproduce Run https://gist.github.com/Plantain/18afecfc8f6c049aa8fbc7f92e7d8284 , with decompressed Himawari8 full-disk imagery
Expected behavior I don't know what an appropriate amount of memory usage is, but I suspect it is considerably less than 24GB for a single worker. I understand Dask is meant to enable chunking of tasks into smaller size components, and something is not working correctly here.
Actual results Observing memory usage with top shows it consistently using ~24GB of RAM
Environment Info: