metoppv / improver

IMPROVER is a library of algorithms for meteorological post-processing.
http://improver.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
105 stars 85 forks source link

mask slowdown issue #1579

Closed cpelley closed 3 years ago

cpelley commented 3 years ago

Currently the nightly suite run demonstrates issues with some commands involving mask handling.

ppdev suite output:

MEM     INPUT       OUTPUT      CPU     TIME        USER        SYS     COMMAND
534816      84391       13683       99%     0:04.61     4.07        0.52        improver regrid --regrid-mode nearest --regridded-title...
632688      1717887     11174       99%     3:24.80     196.23      8.30        improver regrid --regrid-mode nearest-with-mask --land-sea-mask-vicinity 25000 --regridded-title...

Note the difference in run-time between the two.

cpelley commented 3 years ago

As per offline msg from @benfitzpatrick, I have run the following with profiling switched on

With spatial-weights-from-mask:

improver --profile=- weighted-blending \
--attributes-config grid-blended-attributes.json \
--coordinate model_configuration --cycletime 20211004T1100Z \
--spatial-weights-from-mask ...
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
     4123   48.671    0.012   48.671    0.012 {method 'acquire' of '_thread.lock' objects}
...

Without spatial-weights-from-mask:

~/git/improver/bin/improver --profile=- weighted-blending \
--attributes-config grid-blended-attributes.json \
--coordinate model_configuration --cycletime 20211004T1100Z \
...
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
      366    2.755    0.008    2.755    0.008 {method 'acquire' of '_thread.lock' objects}
...

This looks troubling since this appears as a threading issue :( (I hope not as I guess that would make dask the prime candidate). I'll have a closer look tomorrow - perhaps calling the profiler in the way I have above (improver --profile) is not correct. The internal built user guide seems out of date on this with it suggesting we should be able to call improver weighted-blending --profile but that doesn't seem to be the case.

Oh and I'm not yet sure how the improver call that Ben has shared relates to those shown in the ppdev suite nightly run identified as causing concern.

benfitzpatrick commented 3 years ago

Worth saying we set a number of thread-related environment variables like MKL_NUM_THREADS=1 that I forgot to mention before!

cpelley commented 3 years ago

Worth saying we set a number of thread-related environment variables like MKL_NUM_THREADS=1 that I forgot to mention before!

Ah OK, yes I'm somewhat familiar Cheers

cpelley commented 3 years ago

So pretty sure this is a dask issue. The issue goes away if we realise the data early:

https://github.com/cpelley/improver/tree/516e9fc0f4097910f2f9347e1ec73be2293df23d/improver/blending/calculate_weights_and_blend.py#L291

 288             model_id_attr=model_id_attr,
 289         )
 290         cube = merger(cubelist, cycletime=cycletime)
*291         cube.data
 292 
 293         if "model" in self.blend_coord:
 294             self.blend_coord = copy(MODEL_BLEND_COORD)

Realising the data here avoids issues with dask later on


We then complete in 13s

cpelley commented 3 years ago

Apologies for the delay. Took me a little time to figure out how to run the improver command via a python script (enabling the use of a custom profiler (see below) and allowing debugging via pycharm).

Using sample based profiler kapture I get the following sample based profile: 3

In particular we see two branches where most time is spent, correspond to the following area of code:

1 2

cpelley commented 3 years ago

Looking at the first identified problem area (spacial weights):

I suspect the issue is due to the slicing over the dask array (from cube_to_collapse) where it has inappropriate chunking for the subsequent slicing operation.

>>> cube_to_collapse
<iris 'Cube' of probability_of_lwe_thickness_of_snowfall_amount_above_threshold / (1) (model_id: 3; lwe_thickness_of_snowfall_amount: 23; projection_y_coordinate: 970; projection_x_coordinate: 1042)>
>>> cube_to_collapse.shape
(3, 23, 970, 1042)
>>> cube_to_collapse.lazy_data().chunksize
(1, 23, 970, 1042)
>>> coords_to_slice_over
['model_id', 'projection_y_coordinate', 'projection_x_coordinate']
>>> first_slice.lazy_data().chunksize
(1, 970, 1042)
>>> _to_collapse.lazy_data().chunksize
(1, 23, 970, 1042)
>>> cube_slice.lazy_data().chunksize
(1, 970, 1042)

We see that only the first dimension of the original cube_to_collapse is actually with chunksize less that the dimension length.

Each slice corresponds to the coords_to_slice_over coords, which map to dimensions 0, 2 and 3. I would guess that each slice will then unnecessarily realise 23x the amount of data (copy) than it actually needs due to the chunking of the original dask array (TBC). This then explains why realising the cube_to_collapse cube array BEFORE doing this slicing makes the problem go away.

Saw something suspiciously similar to this in ants before.

cpelley commented 3 years ago

Regarding the second problem area (assuming we continue not to realise the data early that is), we can look further up the branch and see that the problem is again likely caused from slicing the dask array:

4

cpelley commented 3 years ago

scitools/production-os43-2

scitools/production-os45-1

I ran it again except this time with data located my /var/tmp and got the same results.

I'll try running the problem with improver that we used with the earlier production environment to see whether the chunking behaviour is indeed different. Thanks to @PaulAbernethy for pointing me out to the correct to the correct tag release of improver for this env. (v0.16.2). 0.19.0 is then the last version that used iris2.


EDIT: Using 0.16.2 and 0.19.0 runs it in ~15s with scitools/production-os43-2

cpelley commented 3 years ago

So running improver 0.19.0 against scitools/production-os43-2 gives the speeds we expect because as I thought the chunksizes are in line with how we are slicing the cube.

That is, debugging the same variables we described in https://github.com/metoppv/improver/issues/1579#issuecomment-938788589, gives:

cube_to_collapse.lazy_data().chunksize
(1, 1, 970, 1042)
first_slice.lazy_data().chunksize
(1, 970, 1042)

As we can see, this time the original dask array has chunksizes of 1 in BOTH the first and second dimension, meaning that the cube slices aren't realising more than they need (i.e. we don't have the x23 extra data that we have in recent versions where the chunksize is 23 in the second dim).

The question then becomes, what has caused this change in chunksizes assigned to the dask array of cube_to_collapse. A few possibilities:

cpelley commented 3 years ago

So looking further, I see we do a cubelist merge, but the issue with chunksizes originates at the very step of loading the cubes. Seems this is due to a difference in behaviour of iris in choosing 'suitable chunksizes': https://scitools-iris.readthedocs.io/en/latest/whatsnew/2.3.html (iris code changes).

cpelley commented 3 years ago

I suspect this issue has been noticed and raised in https://github.com/SciTools/iris/issues/4107

cpelley commented 3 years ago

Definitely the issue is due to iris' change of chunking as raised in SciTools/iris#4107 Replace, as_lazy_data from _lazy_data.py in iris version: 3.0.3 with the one from iris version: 2.2.0 and we get down to 11s

cpelley commented 3 years ago

From our chat Monday, it seemed that it might be acceptable for now, to read the data into memory on load. I'm not familiar with clize or to a great extent of Python typing, but it looks as though options for load_cube in particular no_lazy_load are not exposed to the CLI.

https://github.com/metoppv/improver/blob/43054bd803f6789c1cee3e36d3b31c495489e3e2/improver/utilities/load.py#L114-L118

The load is achieved through typing via the clize value_converter decorator.

https://github.com/metoppv/improver/blob/43054bd803f6789c1cee3e36d3b31c495489e3e2/improver/cli/weighted_blending.py#L40-L41

I think ideally this should be exposed to any of the commands (weighted-blending, regrid or otherwise) performing a load. Then perhaps we can apply this argument within the improver-suite for the cases effected.

We could expose this to the top-level improver CLI but perhaps that would be nasty to then have it as a global var.

Any suggestions?

Cheers

PaulAbernethy commented 3 years ago

@BelligerG @LaurenceBeard to raise visibility. Green team might not check this board too often :)

BelligerG commented 3 years ago

If you didn't want to expose the option you could always set cli.inputcubes for those CLIs as a new type of object defined in improver/improver/cli/__init__.py and pass the no_lazy_load option?

cpelley commented 3 years ago

Thanks @PaulAbernethy, should I assign the "Project" -> repositories -> "Green team"?

PaulAbernethy commented 3 years ago

@cpelley I wouldn't worry about it. Think most people know about the thread now

cpelley commented 3 years ago

If you didn't want to expose the option...

Well actually, I think having it as an option would be ideal :) but I couldn't immediately see a way to do this neatly across all commands while retaining the approach we have using positional arguments for inputs (cli.inputcubes).

I think it would be great to have lazy or not lazy on a both per command and per input file basis. Having it configurable would mean being able to specify the behaviour within the suite, which would be neat.

improver <commandA> <input-file1> <input-file2> --no-lazy ...

...you could always set cli.inputcubes for those CLIs as a new type of object defined in improver/improver/cli/init.py and pass the no_lazy_load option?

If you mean defining a new type (for example improver.cli.inputcube_nolazy), then having the CLI for the weighted-blending and regrid use this new nolazy type then that sounds reasonable except that this nolazy behaviour would be fixed for those commands (irrespective to whether it needs the workaround or not). Also, there are at at least two interfaces right now effected by this performance degradation. That is the regrid and weighted-blending, perhaps there are more effected to be uncovered. In which case, choosing lazy or not lazy via the suite (i.e. on a per command basis at least) would be highly desirable.

Hmmm. Tell you what, I'll put together my "nasty" solution as a PR, then we can discuss the pros/cons and perhaps better alternatives via clize or otherwise:

improver <commandA> --no-lazy <input-file1> <input-file2> ...

Thanks @BelligerG

BelligerG commented 3 years ago

Hmmm. Tell you what, I'll put together my "nasty" solution as a PR, then we can discuss the pros/cons and perhaps better alternatives via clize or otherwise:

Sounds good, it was just a quick alternative :-)

cpelley commented 3 years ago

The issue with regrid is two-fold:

The first and most significant is that we are currently calling AdjustLandSeaPoints.correct_where_input_true, once for land and once for ocean, for EACH x-y slice of the input cube ("multi-realization data"). This then unnecessarily re-evaluates an unstructured nearest neighbour problem via calls to scipy.interpolate.griddata for each of these slices. Nearest neighbour calculations need only be calculated once for land and once for ocean...

https://github.com/metoppv/improver/blob/1088b59c1fe32a8bfab267b6be956b737633a966/improver/regrid/landsea.py#L260

The second 'issue' is then where dask rears its head, which is that we create these x-y slices as well as create various copies. Due to the change in chunking rules for NetCDF in recent iris, we are then hit (as with the weighted blending) with an unnecessary read from disk bottleneck.

Based on some quick changes locally: caching the nearest neighbour regridder by utilising ckdtree and also realising the data on load, I can confirm we can go from ~2m40s to <15s