Closed cpelley closed 3 years ago
As per offline msg from @benfitzpatrick, I have run the following with profiling switched on
With spatial-weights-from-mask
:
improver --profile=- weighted-blending \
--attributes-config grid-blended-attributes.json \
--coordinate model_configuration --cycletime 20211004T1100Z \
--spatial-weights-from-mask ...
ncalls tottime percall cumtime percall filename:lineno(function)
...
4123 48.671 0.012 48.671 0.012 {method 'acquire' of '_thread.lock' objects}
...
Without spatial-weights-from-mask
:
~/git/improver/bin/improver --profile=- weighted-blending \
--attributes-config grid-blended-attributes.json \
--coordinate model_configuration --cycletime 20211004T1100Z \
...
ncalls tottime percall cumtime percall filename:lineno(function)
...
366 2.755 0.008 2.755 0.008 {method 'acquire' of '_thread.lock' objects}
...
This looks troubling since this appears as a threading issue :( (I hope not as I guess that would make dask the prime candidate).
I'll have a closer look tomorrow - perhaps calling the profiler in the way I have above (improver --profile
) is not correct. The internal built user guide seems out of date on this with it suggesting we should be able to call improver weighted-blending --profile
but that doesn't seem to be the case.
Oh and I'm not yet sure how the improver call that Ben has shared relates to those shown in the ppdev suite nightly run identified as causing concern.
Worth saying we set a number of thread-related environment variables like MKL_NUM_THREADS=1 that I forgot to mention before!
Worth saying we set a number of thread-related environment variables like MKL_NUM_THREADS=1 that I forgot to mention before!
Ah OK, yes I'm somewhat familiar Cheers
So pretty sure this is a dask issue. The issue goes away if we realise the data early:
288 model_id_attr=model_id_attr,
289 )
290 cube = merger(cubelist, cycletime=cycletime)
*291 cube.data
292
293 if "model" in self.blend_coord:
294 self.blend_coord = copy(MODEL_BLEND_COORD)
Realising the data here avoids issues with dask later on
We then complete in 13s
Apologies for the delay. Took me a little time to figure out how to run the improver command via a python script (enabling the use of a custom profiler (see below) and allowing debugging via pycharm).
Using sample based profiler kapture I get the following sample based profile:
In particular we see two branches where most time is spent, correspond to the following area of code:
Looking at the first identified problem area (spacial weights):
I suspect the issue is due to the slicing over the dask array (from cube_to_collapse
) where it has inappropriate chunking for the subsequent slicing operation.
>>> cube_to_collapse
<iris 'Cube' of probability_of_lwe_thickness_of_snowfall_amount_above_threshold / (1) (model_id: 3; lwe_thickness_of_snowfall_amount: 23; projection_y_coordinate: 970; projection_x_coordinate: 1042)>
>>> cube_to_collapse.shape
(3, 23, 970, 1042)
>>> cube_to_collapse.lazy_data().chunksize
(1, 23, 970, 1042)
>>> coords_to_slice_over
['model_id', 'projection_y_coordinate', 'projection_x_coordinate']
>>> first_slice.lazy_data().chunksize
(1, 970, 1042)
>>> _to_collapse.lazy_data().chunksize
(1, 23, 970, 1042)
>>> cube_slice.lazy_data().chunksize
(1, 970, 1042)
We see that only the first dimension of the original cube_to_collapse
is actually with chunksize less that the dimension length.
Each slice corresponds to the coords_to_slice_over
coords, which map to dimensions 0
, 2
and 3
. I would guess that each slice will then unnecessarily realise 23x the amount of data (copy) than it actually needs due to the chunking of the original dask array (TBC).
This then explains why realising the cube_to_collapse
cube array BEFORE doing this slicing makes the problem go away.
Saw something suspiciously similar to this in ants before.
Regarding the second problem area (assuming we continue not to realise the data early that is), we can look further up the branch and see that the problem is again likely caused from slicing the dask array:
scitools/production-os43-2
scitools/production-os45-1
I ran it again except this time with data located my /var/tmp
and got the same results.
I'll try running the problem with improver that we used with the earlier production environment to see whether the chunking behaviour is indeed different. Thanks to @PaulAbernethy for pointing me out to the correct to the correct tag release of improver for this env. (v0.16.2).
0.19.0
is then the last version that used iris2.
EDIT: Using 0.16.2
and 0.19.0
runs it in ~15s with scitools/production-os43-2
So running improver 0.19.0
against scitools/production-os43-2
gives the speeds we expect because as I thought the chunksizes are in line with how we are slicing the cube.
That is, debugging the same variables we described in https://github.com/metoppv/improver/issues/1579#issuecomment-938788589, gives:
cube_to_collapse.lazy_data().chunksize
(1, 1, 970, 1042)
first_slice.lazy_data().chunksize
(1, 970, 1042)
As we can see, this time the original dask array has chunksizes of 1 in BOTH the first and second dimension, meaning that the cube slices aren't realising more than they need (i.e. we don't have the x23 extra data that we have in recent versions where the chunksize is 23 in the second dim).
The question then becomes, what has caused this change in chunksizes assigned to the dask array of cube_to_collapse
. A few possibilities:
improver
where we do something to derive this cube differently (perhaps we do through an operation effecting chunksizes differently now - for example merge/slice/concatenation pathways).iris
where it instantiates our loaded cube (the dask array contained within it) with a different chunksize.dask
if iris or otherwise through dask usage results in a different heuristic in defining the chunksizes.So looking further, I see we do a cubelist merge, but the issue with chunksizes originates at the very step of loading the cubes. Seems this is due to a difference in behaviour of iris in choosing 'suitable chunksizes': https://scitools-iris.readthedocs.io/en/latest/whatsnew/2.3.html (iris code changes).
I suspect this issue has been noticed and raised in https://github.com/SciTools/iris/issues/4107
Definitely the issue is due to iris' change of chunking as raised in SciTools/iris#4107
Replace, as_lazy_data
from _lazy_data.py
in iris version: 3.0.3 with the one from iris version: 2.2.0 and we get down to 11s
From our chat Monday, it seemed that it might be acceptable for now, to read the data into memory on load.
I'm not familiar with clize
or to a great extent of Python typing, but it looks as though options for load_cube
in particular no_lazy_load
are not exposed to the CLI.
The load is achieved through typing via the clize value_converter decorator.
I think ideally this should be exposed to any of the commands (weighted-blending
, regrid
or otherwise) performing a load.
Then perhaps we can apply this argument within the improver-suite for the cases effected.
We could expose this to the top-level improver
CLI but perhaps that would be nasty to then have it as a global var.
Any suggestions?
Cheers
@BelligerG @LaurenceBeard to raise visibility. Green team might not check this board too often :)
If you didn't want to expose the option you could always set cli.inputcubes
for those CLIs as a new type of object defined in improver/improver/cli/__init__.py
and pass the no_lazy_load
option?
Thanks @PaulAbernethy, should I assign the "Project" -> repositories -> "Green team"?
@cpelley I wouldn't worry about it. Think most people know about the thread now
If you didn't want to expose the option...
Well actually, I think having it as an option would be ideal :) but I couldn't immediately see a way to do this neatly across all commands while retaining the approach we have using positional arguments for inputs (cli.inputcubes
).
I think it would be great to have lazy or not lazy on a both per command and per input file basis. Having it configurable would mean being able to specify the behaviour within the suite, which would be neat.
improver <commandA> <input-file1> <input-file2> --no-lazy ...
...you could always set cli.inputcubes for those CLIs as a new type of object defined in improver/improver/cli/init.py and pass the no_lazy_load option?
If you mean defining a new type (for example improver.cli.inputcube_nolazy
), then having the CLI for the weighted-blending
and regrid
use this new nolazy type then that sounds reasonable except that this nolazy behaviour would be fixed for those commands (irrespective to whether it needs the workaround or not). Also, there are at at least two interfaces right now effected by this performance degradation. That is the regrid
and weighted-blending
, perhaps there are more effected to be uncovered. In which case, choosing lazy or not lazy via the suite (i.e. on a per command basis at least) would be highly desirable.
Hmmm. Tell you what, I'll put together my "nasty" solution as a PR, then we can discuss the pros/cons and perhaps better alternatives via clize or otherwise:
improver <commandA> --no-lazy <input-file1> <input-file2> ...
Thanks @BelligerG
Hmmm. Tell you what, I'll put together my "nasty" solution as a PR, then we can discuss the pros/cons and perhaps better alternatives via clize or otherwise:
Sounds good, it was just a quick alternative :-)
The issue with regrid is two-fold:
The first and most significant is that we are currently calling AdjustLandSeaPoints.correct_where_input_true
, once for land and once for ocean, for EACH x-y slice of the input cube ("multi-realization data"). This then unnecessarily re-evaluates an unstructured nearest neighbour problem via calls to scipy.interpolate.griddata for each of these slices. Nearest neighbour calculations need only be calculated once for land and once for ocean...
The second 'issue' is then where dask rears its head, which is that we create these x-y slices as well as create various copies. Due to the change in chunking rules for NetCDF in recent iris, we are then hit (as with the weighted blending) with an unnecessary read from disk bottleneck.
Based on some quick changes locally: caching the nearest neighbour regridder by utilising ckdtree and also realising the data on load, I can confirm we can go from ~2m40s to <15s
Currently the nightly suite run demonstrates issues with some commands involving mask handling.
ppdev suite output:
Note the difference in run-time between the two.