pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.05k stars 289 forks source link

Large overhead resampling multiple datasets to multiple areas with FCI #2620

Open gerritholl opened 10 months ago

gerritholl commented 10 months ago

Describe the bug

When saving multiple composites to multiple areas, there is a very large overhead before dask gets to work using all nodes.

To Reproduce

The following might seem convoluted, but is a pretty close approximation to what trollflow2 does in practice and therefore to what happens operationally. @pnuu may correct me if I'm wrong.

import hdf5plugin
from satpy import Scene
from glob import glob
from satpy.utils import debug_on; debug_on()
from satpy.writers import compute_writer_results
from dask.diagnostics import Profiler, ResourceProfiler, visualize
import dask.config
fci_files = glob("/media/x21308/MTG_test_data/2022_05_MTG_Testdata/RC0040/*BODY*.nc")
res500 = ["vis_06"]
res1000 = ["vis_08", "nir_16", "ir_38", "ir_105"]
res2000 = ["wv_63", "wv_73", "ir_87", "ir_97", "ir_123", "ir_133", "airmass",
           "convection", "dust", "ash"]
cd = "/data/gholl/cache/pytroll"
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof:
    sc = Scene(filenames={"fci_l1c_nc": fci_files})
    sc.load(res500 + res1000 + res2000, generate=False)
    ls1 = sc.resample("nqeuro500m", cache_dir="/data/gholl/cache/pytroll")
    ls2 = sc.resample("nqeuro1km", cache_dir="/data/gholl/cache/pytroll")
    ls3 = sc.resample("nqeuro2km", cache_dir="/data/gholl/cache/pytroll")
    comps = []
    for c in res500:
        comps.append(ls1.save_dataset(c, filename=f"{c:s}.tif", compute=False))
    for c in res1000:
        comps.append(ls2.save_dataset(c, filename=f"{c:s}.tif", compute=False))
    for c in res2000:
        comps.append(ls3.save_dataset(c, filename=f"{c:s}.tif", compute=False))
    compute_writer_results(comps)
visualize([prof, rprof], show=False, save=True, filename="fci-dask-multi-dataset.html")

Expected behavior

I would expect a small overhead and then many dask workers working along nicely.

Actual results

The dask graph shows the clearest illustation of the problem:

grafik

Once dask gets to work, it's fine. But most of the time is overhead.

The full text log output is 391 KB and cannot be included here. Some key moments:

[DEBUG: 2023-10-26 12:53:54 : satpy.readers.fci_l1c_nc] Reading: /media/x21308/MTG_test_data/2022_05_MTG_Testdata/RC0040/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY---NC4E_C_EUMT_20170920064325_GTT_DEV_20170920063743_20170920063825_N_JLS_T_0040_0033.nc                                [22/1582]
[DEBUG: 2023-10-26 12:53:54 : satpy.readers.fci_l1c_nc] Start: 2017-09-20 06:30:00
[DEBUG: 2023-10-26 12:53:54 : satpy.readers.fci_l1c_nc] End: 2017-09-20 06:40:00
[DEBUG: 2023-10-26 12:53:55 : satpy.readers.fci_l1c_nc] Reading: /media/x21308/MTG_test_data/2022_05_MTG_Testdata/RC0040/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY---NC4E_C_EUMT_20170920064407_GTT_DEV_20170920063835_20170920063907_N_JLS_T_0040_0037.nc
[DEBUG: 2023-10-26 12:53:55 : satpy.readers.fci_l1c_nc] Start: 2017-09-20 06:30:00
[DEBUG: 2023-10-26 12:53:55 : satpy.readers.fci_l1c_nc] End: 2017-09-20 06:40:00

[DEBUG: 2023-10-26 12:54:09 : satpy.readers.fci_l1c_nc] Reading ir_38_pixel_quality from /media/x21308/MTG_test_data/2022_05_MTG_Testdata/RC0040/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+FCI-1C-RRAD-FDHSI-FD--CHK-BODY---NC4E_C_EUMT_20170920064422_GTT_DEV_20170920063912_20170920063922_N_JLS_T_0040_0040.nc
[DEBUG: 2023-10-26 12:54:09 : satpy.readers.yaml_reader] Requested orientation for Dataset Non
e is 'native' (default). No flipping is applied.
[DEBUG: 2023-10-26 12:54:09 : satpy.scene] Resampling DataID(name='ir_105', wavelength=WavelengthRange(min=9.8, central=10.5, max=11.2, unit='µm'), resolution=2000, calibration=<2>, modifiers=())
[INFO: 2023-10-26 12:54:29 : satpy.resample] Using default KDTree resampler
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/numpy/lib/function_base.py:1448: RuntimeWarning: invalid value encountered in subtract
  a = op(a[slice1], a[slice2])
[DEBUG: 2023-10-26 12:54:29 : satpy.resample] Check if /data/gholl/cache/pytroll/resample_lut-4a8b992c2ba6ddc5bee53fd986a838236d873d25.npz exists
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Read pre-computed kd-tree parameters
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Resampling None
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/pyresample/kd_tree.py:1131: PerformanceWarning: Increasing number of chunks by factor of 13
  res = blockwise(_my_index, dst_adims,
[DEBUG: 2023-10-26 12:54:38 : satpy.scene] Resampling DataID(name='ir_105_pixel_quality', resolution=2000, modifiers=())
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Check if /data/gholl/cache/pytroll/resample_lut-4a8b992c2ba6ddc5bee53fd986a838236d873d25.npz exists
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Read pre-computed kd-tree parameters
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Resampling pixel_quality
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/pyresample/kd_tree.py:1131: PerformanceWarning: Increasing number of chunks by factor of 13
  res = blockwise(_my_index, dst_adims,
[DEBUG: 2023-10-26 12:54:38 : satpy.scene] Resampling DataID(name='ir_123', wavelength=WavelengthRange(min=11.8, central=12.3, max=12.8, unit='µm'), resolution=2000, calibration=<2>, modifiers=())
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Check if /data/gholl/cache/pytroll/resample_lut-4a8b992c2ba6ddc5bee53fd986a838236d873d25.npz exists
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Read pre-computed kd-tree parameters
[DEBUG: 2023-10-26 12:54:38 : satpy.resample] Resampling None

[DEBUG: 2023-10-26 12:55:07 : satpy.resample] Resampling pixel_quality
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/pyresampl
e/kd_tree.py:1131: PerformanceWarning: Increasing number of chunks by factor of 13
  res = blockwise(_my_index, dst_adims,
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_ash_dep_0', resolu
tion=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_ash_dep_1', resolu
tion=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_dust_dep_0', resol
ution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_airmass_dep_0', re
solution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_dust_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_2', resolution=1000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_0', resolution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_1',
 resolution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Unloading dataset: DataID(name='_airmass_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:07 : satpy.scene] Resampling DataID(name='ir_105', wavelength=WavelengthRange(min=9.8, central=10.5, max=11.2, unit='µm'), resolution=2000, calibration=<2>, modifiers=())
[INFO: 2023-10-26 12:55:17 : satpy.resample] Using default KDTree resampler
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/numpy/lib/function_base.py:1448: RuntimeWarning: invalid value encountered in subtract
  a = op(a[slice1], a[slice2])
[DEBUG: 2023-10-26 12:55:17 : satpy.resample] Check if /data/gholl/cache/pytroll/resample_lut-a8e5f2ea895bec08e52280a834ffd6888f603025.npz exists
[DEBUG: 2023-10-26 12:55:19 : satpy.resample] Read pre-computed kd-tree parameters
[DEBUG: 2023-10-26 12:55:19 : satpy.resample] Resampling None

[DEBUG: 2023-10-26 12:55:44 : satpy.resample] Resampling None
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Resampling DataID(name='wv_73_pixel_quality', resolution=2000, modifiers=())
[DEBUG: 2023-10-26 12:55:44 : satpy.resample] Check if /data/gholl/cache/pytroll/resample_lut-9580df4ec1e01ac6df93728499d57a1aa2d6ffad.npz exists
[DEBUG: 2023-10-26 12:55:44 : satpy.resample] Read pre-computed kd-tree parameters
[DEBUG: 2023-10-26 12:55:44 : satpy.resample] Resampling pixel_quality
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_ash_dep_0', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_ash_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_dust_dep_0', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_airmass_dep_0', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_dust_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_2', resolution=1000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_0', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_convection_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.scene] Unloading dataset: DataID(name='_airmass_dep_1', resolution=2000)
[DEBUG: 2023-10-26 12:55:44 : satpy.writers] Reading ['/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/satpy/etc/writers/geotiff.yaml']
[DEBUG: 2023-10-26 12:55:44 : rasterio.session] Could not import boto3, continuing with reduced functionality.
[DEBUG: 2023-10-26 12:55:44 : satpy.writers] Adding enhancement configuration from file: /opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/satpy/etc/enhancements/generic.yaml
[DEBUG: 2023-10-26 12:55:44 : satpy.writers] Adding enhancement configuration from file: /opt/pytroll/pytroll_inst/config/enhancements/generic.yaml
[DEBUG: 2023-10-26 12:55:44 : satpy.writers] Data for DataID(name='vis_06', wavelength=WavelengthRange(min=0.59, central=0.64, max=0.69, unit='µm'), resolution=1000, calibration=<1>, modifiers=()) will be enhanced with options:
        [{'name': 'linear_stretch', 'method': <function stretch at 0x7f8182470180>, 'kwargs': {'stretch': 'crude', 'min_stretch': 0.0, 'max_stretch': 100.0}}, {'name': 'gamma', 'method': <function gamma at 0x7f81824709a0>, 'kwargs': {'gamma': 1.5}}]
[DEBUG: 2023-10-26 12:55:44 : trollimage.xrimage] Applying stretch crude with parameters {'min_stretch': 0.0, 'max_stretch': 100.0}
[DEBUG: 2023-10-26 12:55:44 : trollimage.xrimage] Applying gamma 1.5

[DEBUG: 2023-10-26 12:55:47 : satpy.writers] Data for DataID(name='ash', resolution=2000) will be enhanced with options:
        [{'name': 'stretch', 'method': <function stretch at 0x7f8182470180>, 'kwargs': {'stretch': 'crude', 'min_stretch': [-4, -4, 243], 'max_stretch': [2, 5, 303]}}]
[DEBUG: 2023-10-26 12:55:47 : trollimage.xrimage] Applying stretch crude with parameters {'min_stretch': [-4, -4, 243], 'max_stretch': [2, 5, 303]}
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Entering env context: <rasterio.env.Env object at 0x7f8181fd7650>
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Starting outermost env
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] No GDAL environment exists
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] New GDAL environment <rasterio._env.GDALEnv object at 0x7f81827a1210> created
[DEBUG: 2023-10-26 12:55:47 : rasterio._env] GDAL_DATA found in environment.
[DEBUG: 2023-10-26 12:55:47 : rasterio._env] PROJ_DATA found in environment.
[DEBUG: 2023-10-26 12:55:47 : rasterio._env] Started GDALEnv: self=<rasterio._env.GDALEnv object at 0x7f81827a1210>.
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Entered env context: <rasterio.env.Env object at 0x7f8181fd7650>
[DEBUG: 2023-10-26 12:55:47 : rasterio._io] Path: _UnparsedPath(path='ash.tif'), mode: w, driver: GTiff
[DEBUG: 2023-10-26 12:55:47 : rasterio._base] Nodata success: 0, Nodata value: 0.000000
[DEBUG: 2023-10-26 12:55:47 : rasterio._base] Nodata success: 0, Nodata value: 0.000000
[DEBUG: 2023-10-26 12:55:47 : rasterio._base] Nodata success: 0, Nodata value: 0.000000
[DEBUG: 2023-10-26 12:55:47 : rasterio._base] Nodata success: 0, Nodata value: 0.000000
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Exiting env context: <rasterio.env.Env object at 0x7f8181fd7650>
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Cleared existing <rasterio._env.GDALEnv object at 0x7f81827a1210> options
[DEBUG: 2023-10-26 12:55:47 : rasterio._env] Stopped GDALEnv <rasterio._env.GDALEnv object at 0x7f81827a1210>.
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Exiting outermost env
[DEBUG: 2023-10-26 12:55:47 : rasterio.env] Exited env context: <rasterio.env.Env object at 0x7f8181fd7650>
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/dask/core.py:127: RuntimeWarning: invalid value encountered in log
  return func(*(_execute_task(a, cache) for a in args))
/opt/pytroll/pytroll_inst/mambaforge/envs/pytroll-py311/lib/python3.11/site-packages/dask/array/chunk.py:278: RuntimeWarning: invalid value encountered in cast
  return x.astype(astype_dtype, **kwargs)
[DEBUG: 2023-10-26 12:55:59 : trollimage.xrimage] Interval: left=<xarray.DataArray (bands: 1)>
array([79.94678704])
Coordinates:
    quantile  float64 0.005
Dimensions without coordinates: bands, right=<xarray.DataArray (bands: 1)>
array([252.2885688])
Coordinates:
    quantile  float64 0.995
Dimensions without coordinates: bands

This reveals that it spends about 7 seconds reading the files a first time for the headers, about 8 seconds reading the variables from the files, then about 90+ seconds with "resampling" messages, which probably means building dask graphs, and another 3 seconds doing the same for the enhancements. It seems to build those resampling dask graphs for all composites/channels and all pixel qualities, even the ones we end up not using. That means the dask overhead penalty is disproportionately high in this case.

Maybe this should in part be a trollflow2 issue, as it's trollflow2 that's asking satpy to build dask graphs that it ends up not computing.

The final computation is quick.

Environment Info:

Not sure how to label what is essentially a performance problem.

djhoese commented 10 months ago

From my experimenting with my ABI full disk -> "mercator subset" area yesterday, I have concluded that the reduce_data step (and specifically the AreaDefinition.get_area_slices step, which is specifically the polygon intersection calculations) takes about 10-12 seconds on my machine per area combination. How many unique areas does FCI have? I'm guessing at least 3 for the 3 resolutions involved here? So that's 3 source areas 3 target areas = 9 polygon intersection checks. Not to mention, I don't think the polygon creation is cached in any way so that is also repeated for each one of those. So 9 intersections 10-12 seconds = 90 to 108 seconds before we even get out of the .resample block of the code.

I mentioned yesterday, but I'd like to add JSON on-disk caching of the get_area_slices function while we wait for polygon operations to be made faster in pyresample.

That means the dask overhead penalty is disproportionately high in this case.

I disagree with the idea that this is "dask overhead". This isn't building the dask graph. This is Satpy and Pyresample wasting time doing things that:

  1. Take longer than we assume they should (bad algorithms written in python instead of C or other compiled language).
  2. Aren't cached.
gerritholl commented 10 months ago

I didn't use any HRFI data in this test, so the source data is just 2 areas. If I understand you correctly, that means the overhead will get even more once I include HRFI (which includes two channels at 500 metre resolution, unlike FDHSI, which has only 1 km and 2 km resolutions).

Sorry, I thought it was all dask overhead, but I see it's "our fault" in this case.

djhoese commented 10 months ago

You'd have to add prints inside Scene._resample_scene around the reduce data logic to see how much time it's spending there and if it seems reasonable. I also assume the cached resampling was pre-cached in a previous execution.

gerritholl commented 10 months ago

I profiled it with cProfile and found that it spends 160.4 seconds in _resampled_scene (3 calls), 133 seconds in _reduce_data (66 calls), including 132 seconds in intersection (141792 calls). By comparison, compute_writer_results is only 37 seconds:

image

Yes, this uses cached resampling.

gerritholl commented 10 months ago

If I run with reduce_data=False, the overall processing is much faster, despite compute_writer_results increasing from 37s to 54s. Bokeh plot:

image

cProfile result:

image

djhoese commented 10 months ago

Do your target areas cover a majority of the input data? Let's say more than 75%? If so then I'd expect similar processing time. My case was an ABI case where I think I had cut off quite a bit of the full disk input data and when I did reduce_data=False the processing took much longer. I wasn't seeing so much of a slow down from resampling/reduce_data though. Let me look closer at this cprofile output...

djhoese commented 10 months ago

Oh wow, your per-call for the get_area_slices is almost twice as long as mine was (22s versus my 10-12s). That would explain the large dead zone in your original plot for only 2 source areas at 3 targets: 6 * ~22s => 132s.

gerritholl commented 10 months ago

Do your target areas cover a majority of the input data? Let's say more than 75%?

No, not in this case, but some of them do in the operational case (which takes much longer to run).

djhoese commented 9 months ago

@gerritholl with pyresample main you should be able to use:

https://pyresample.readthedocs.io/en/latest/howtos/configuration.html#cache-geometry-slices

This should almost completely remove the slow down of reduce_data.