Closed pulkkins closed 3 years ago
Merging #221 (13afd6a) into master (3485498) will increase coverage by
0.82%
. The diff coverage is93.09%
.
@@ Coverage Diff @@
## master #221 +/- ##
==========================================
+ Coverage 79.94% 80.76% +0.82%
==========================================
Files 138 140 +2
Lines 9971 10632 +661
==========================================
+ Hits 7971 8587 +616
- Misses 2000 2045 +45
Flag | Coverage Δ | |
---|---|---|
unit_tests | 80.76% <93.09%> (+0.82%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
pysteps/tests/test_interfaces.py | 99.28% <ø> (ø) |
|
pysteps/nowcasts/linda.py | 92.42% <92.42%> (ø) |
|
pysteps/nowcasts/interface.py | 95.65% <100.00%> (+0.19%) |
:arrow_up: |
pysteps/tests/helpers.py | 94.28% <100.00%> (+0.16%) |
:arrow_up: |
pysteps/tests/test_nowcasts_linda.py | 100.00% <100.00%> (ø) |
|
pysteps/utils/dimension.py | 82.11% <0.00%> (+0.40%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 3485498...13afd6a. Read the comment docs.
I noticed some duplication in the code, for example you reimplemented the Tukey tapering window even though we already have it available in the utils.tapering module. Probably not something urgent to be done now, but I would strongly recommend to try to reduce code duplication by using methods already available in pysteps. If for some reasons the existing methods do not cover your needs, we should first try to extend or modify them instead of creating new ones (especially now that V2 #216 will give us more freedom for breaking changes). Otherwise, the risk is to get to a point of having a code base that is impossible to maintain...
What do you think, @pulkkins ? Is there something we can do now or is it better to close this PR and keep this in mind for a small refactoring in V2?
Concerning performance now, I run some basic profiling using Swiss radar data (640x710 domain) and default arguments to produce one hour nowcast (12 time steps).
Computing LINDA nowcast
-----------------------
Inputs
------
dimensions: 640x710
number of time steps: 3
Methods
-------
nowcast type: ensemble
feature detector: blob
extrapolator: semilagrangian
kernel type: anisotropic
Parameters
----------
number of time steps: 12
ARI model order: 1
localization window radius: 128.0
error dist. window radius: 96.0
error ACF window radius: 160.0
ensemble size: 40
parallel workers: 1
seed: 24
Detecting features... found 138 blobs in 1.10 seconds.
Transforming to Lagrangian coordinates... 0.41 seconds.
Estimating the first convolution kernel... 406.42 seconds.
Estimating the ARI(p,1) parameters... 3.27 seconds.
Estimating the second convolution kernel... 304.47 seconds.
Estimating forecast errors... 15.12 seconds.
Estimating perturbation parameters... 561.82 seconds.
Note that the computation of the 40 ensemble members for 12 time steps took more than 1 hour! Peak memory at about 2.5 GB during the model fit.
Hence
Increasing the detection threshold to 10 mm/h reduces the number of blobs to 11, speeding up considerably the forecast. Also, the default number of members can be easily reduced to 10. For larger ensemble sizes, we should advise on using Dask by specifying a number of workers > 1. The detection parameter is not necessarily the most robust way to control the number of features. Is there any better way of doing it?
[...]
Parameters
----------
number of time steps: 12
ARI model order: 1
localization window radius: 128.0
error dist. window radius: 96.0
error ACF window radius: 160.0
ensemble size: 10
parallel workers: 1
seed: 24
Detecting features... found 11 blobs in 1.10 seconds.
Transforming to Lagrangian coordinates... 0.43 seconds.
Estimating the first convolution kernel... 39.61 seconds.
Estimating the ARI(p,1) parameters... 0.34 seconds.
Estimating the second convolution kernel... 33.70 seconds.
Estimating forecast errors... 1.74 seconds.
Estimating perturbation parameters... 54.22 seconds.
Total time is now down to about 6 minutes, with ca 1.6 seconds / time step / member, and a 1.2 GB peak memory.
Using 10 workers (for parallel threads) get total time further down to 80 seconds, although at the cost of a larger memory footpring (4 GB):
[...]
Parameters
----------
number of time steps: 12
ARI model order: 1
localization window radius: 128.0
error dist. window radius: 96.0
error ACF window radius: 160.0
ensemble size: 10
parallel workers: 10
seed: 24
Detecting features... found 11 blobs in 0.98 seconds.
Transforming to Lagrangian coordinates... 0.27 seconds.
Estimating the first convolution kernel... 7.49 seconds.
Estimating the ARI(p,1) parameters... 0.13 seconds.
Estimating the second convolution kernel... 6.70 seconds.
Estimating forecast errors... 1.57 seconds.
Estimating perturbation parameters... 36.29 seconds.
I noticed some duplication in the code, for example you reimplemented the Tukey tapering window even though we already have it available in the utils.tapering module. Probably not something urgent to be done now, but I would strongly recommend to try to reduce code duplication by using methods already available in pysteps. If for some reasons the existing methods do not cover your needs, we should first try to extend or modify them instead of creating new ones (especially now that V2 #216 will give us more freedom for breaking changes). Otherwise, the risk is to get to a point of having a code base that is impossible to maintain...
What do you think, @pulkkins ? Is there something we can do now or is it better to close this PR and keep this in mind for a small refactoring in V2?
Let's do that in V2. The purpose of this pull request is to implement the first usable version of LINDA. It can be refined later.
With the changes introduced in #225, were you thinking of adding a default max_num_features
? Something around 20?
With the changes introduced in #225, were you thinking of adding a default
max_num_features
? Something around 20?
Done.
Very nice, well done @pulkkins ! If you agree, id like to do one last profiling with the default parameters and, if everything looks good, I think we can merge.
Edit: here it is for the same case as above (640x710 domain, 12 time steps).
Sequential | Parallel | |
---|---|---|
Parameters | 25 blobs, 10 members, 1 worker | 25 blobs, 10 members, 10 workers |
Total time (forecast loop) | 11 min (ca. 4 seconds / member / time step) | 4 min (ca. 0.5 seconds / member / time step) |
Peak memory | 1.2 GB | 4.2 GB |
The RTD built for this branch can be found here: https://pysteps.readthedocs.io/en/ide_nowcasting/
This branch implements the Lagrangian INtegro-Difference equation model with Autoregression (LINDA) developed by Pulkkinen et al. (2021).
Remaining issues that will be addressed in the next version: