nci / scores

Metrics for the verification, evaluation and optimisation of forecasts, predictions or models.

https://scores.readthedocs.io/

Apache License 2.0

58 stars 15 forks source link

New score request: Fractions Skill Score #188

Closed tennlee closed 4 months ago

tennlee commented 6 months ago

Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events in: Monthly Weather Review Volume 136 Issue 1 (2008) (ametsoc.org)

https://www.researchgate.net/publication/269222763_Fast_calculation_of_the_Fractions_Skill_Score

Scores users have requested the Fractions Skill Score to be added. The above two references provide information on the definition of the score and how it could be implemented.

nicholasloveday commented 6 months ago

We can confirm that the implementation works by comparing it to Nathan Faggian's implementation copied over at https://github.com/nathan-eize/fss/blob/master/fss_core.py using xr.apply_ufunc

nikeethr commented 6 months ago

The obs & fcst fields are compared against a threshold field. The result is a binary field for each.

The alg mainly focuses on a 2-D binary field (but can probably be generalized). The cumulative sum (or integral image) of the 2-D binary field is a major part of the computation; which is done once, and cached for the obs and fcst binary fields. The cached fields are then used to compute the fractionals over various windows (typically fixed-size sliding windows).

Since addition is associative, the cumulative sum computation can be decomposed quite easily into parallel operations. e.g. row first then columns (looks like this is similar to what Nathan F. has done).

Once that's done the fractional for a window can be computed using 4 reference points e.g.

     0,0                                                       
       ┌────┬────────────┐                                     
       │    │            │                                     
       │   a│           b│                                     
       ├────┼────────────┤ ▲                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │ m                                 
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │   c│\\\\\\\\\\\d│ │                                   
       └────┴────────────┘ ▼                                   
            ◄────────────►                                     
                  n                                            
       shaded fractional = (d - b - c + a) / (m x n) for an arbitrary n by m window              
       where a,b,c,d are reference values from cumsum field w.r.t the bounding box of the window.

The resulting fractionals for obs and fcst can then can be used to calculate the fractional skill score (fss) for given a set of windows (typically fixed-size, sliding).

The alg shouldn't be too complex: Cache

Apply threshold over the obs/fst fields to generate the binary fields
Compute cumulative sum over rows (can be optionally be sped up using dask)
Repeat above for columns
We now have the cumulative sum of the obs/fcst binary fields which can be used to derive the fractional fcst/obs (p_f/p_o) for a given window

FSS

Given a set of windows, derive p_f & p_o from the cache i.e. using the bounding box computation described above. (can be optionally be sped up using dask)
Accumulate the fractional scores for each window to compute the final metric.

nikeethr commented 6 months ago

[optimization ]

This is purely optional and requires some investigation.

A potential improvement to the above algorithm, is to utilize sparsity of the binary fields. Assuming that the choice of threshold results in a sparse field; i.e. count of ones << n x m note that the sparsity assumption also holds if count of ones ~ n x m => count of zeros << n x m.

In such a scenario we can retrieve a semi-sorted set of sparse points k << n x m in the N-D space and construct a data structure (e.g. kd-tree) for quick insert and search. Each data node will contain the coordinate as well as the cumulative sum which can be computed by a similar method as above using:

  d = b + c - a + v,
  where,
  d is the cumulative sum
  v is the value at the coordinate (i,j)
  a, b and c are the closest points to (i,j) such that loc(b) < (i-1, j), loc(a) < (i-1, j-1), loc(c) < (i, j-1).
  Note that a,b,c could be from the same coordinate - this is fine.

When computing the fractional for the windows, we just take the closest data node to the each point on the bounding box of each window, and use them instead. Given that the data is sparse, these computations can also be cached; as neighbouring windows are likely to re-use the same set of four reference points.

While I believe this will greatly speed up computations for sparse fields, I'm not sure how it will perform if p(1) ~= p(0), but this should be a trivial investigation using binomial sampling. Assuming that the above only works well for sparse matrices, we could attempt to infer it. For inference, random sampling approach could be used which is computationally cheap; alternatively we could pass in an optional argument for the user to decide.

nikeethr commented 5 months ago

Implementation considerations

After some research, I believe that the python implementation referenced above provides a good trade-off between various approaches. The only thing to note is that if we intend to use dask - e.g. via xarray, it may potentially hinder in-built numpy parallelisation, and would have to be used with care.

Some notes here:

Current numpy versions have dist wheels that come with OpenBlas. This means the operations automatically use multiple threads. In theory we need not worry about low level optimizations like parallelising cumsum. If dask is being used for higher level parallelisation, we should limit the numpy threads. While this comes at a trade-off, it maybe required in certain environments. We may need a separate issue to investigate this if it becomes an issue. Details here: https://numpy.org/install/#numpy-packages--accelerated-linear-algebra-libraries
Recent versions of numpy also provides a convenience function for calculating this, but makes no promises on performance: https://numpy.org/devdocs/reference/generated/numpy.lib.stride_tricks.sliding_window_view.html
[optional] For stride-lengths of 1, the Integral image calculation is often not necessary, since we can compute the 2-D sliding window directly by decomposing it into sliding 1-D prefix sums, with no trade-off in performance. While this reduces the need to cache/compute the integral image, it does not expand nicely if we want to consider arbitrary stride lengths.
[beyond scope] While the kd-tree algorithm above is interesting in theory, and is used in other similar applications (sliding window template detection, partial summed volume tables in sparse volumetric data etc.); performance gains requires reasonable sparsity, with 1000x1000 my rough calculation shows that the data populace should be <5%, which may not be very often in this application. There are partial trees, and sqrt trees that improve upon this, but probably non-trivial for this exercise.
[beyond scope] Note that GPU implementations further improve on computations by using work efficient algorithms for computation of prefix sums, but this is beyond scope of this issue. See: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda

nikeethr commented 5 months ago

numba has a stencil(neighbourhood=...) convenience decorator we can also try out, which makes the implementation look straightforward: https://numba.readthedocs.io/en/stable/user/stencil.html#neighborhood. It may be worth benchmarking against a systems language like rust. I think we need to investigate the following:

baseline with rust
numba
numpy baseline
numpy + blas (i.e. using matrix ops rather than cumsum)
dask + numpy (this should implicitly test xarray as well)

As a side note, numba has the additional benefit of being able to interface to cuda quite easily. A lot of metrics in theory would benefit from accumulation operations (e.g. CDFs), so findings from this may apply to other parts of scores too.

nikeethr commented 5 months ago

Sum area table

Some brief timings show that multi-threading only really provides benefits at around 1000x1000, where the implementations are tied at around 5-20ms.

For 4000x4000 however, we have:

rust/c: 50ms
numba: 120ms
numpy: 800ms

I still need to profile the entire sliding algorithm, however this should be around the same ballpark as its roughly the same order of operations.

Note that the vectorized implementations I tried with blas actually slowed things down. It is likely that the overhead of vectorisation is not particularly offset by its benefits. Most modern processors are capable of doing vector instructions for simpler problems without needing blas.

nikeethr commented 5 months ago

Was out of action last week. Here are some benchmarks. Meanwhile:

tried out opencl implementation
tabulated the bench results

Conclusions:

I don't think it's worth adding in a rust/opencl implementation at the moment. I believe we're not likely to go to resolutions 4000x4000 and beyond to reap the benefits. See table at the end.
numba performs really well for this operation
numbagg for some reason doesn't perform any faster than plain numpy, despite using the in-built decorators. But that may be due to my ineptitude in using it. Will reinvestigate later...
While I didn't explicitly investigate dask, I believe that within a singular compute device dask may not provide significant advantages, as threads are very saturated already during the benchmarks. However, for HPC/multi-device situations, it maybe useful for async optimizations and scheduling.

Here's the implementation plan:

Most of the initial implementation in numpy + numba as a extra dependency.
Initial notebook showing how to aggregate results e.g. across time/levels using dask.
Add in the benchmarks for rust/opencl for the curious - though will not integrate it as an extra dependency.

Device
------
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
Cores: 4 x 2
Mem: 4GB
OS: WSL-2 Ubuntu 22.04 Linux
GPU: Intel Integrated Graphics (Shared memory with host cpu)

Bench
-----
                +-----------+-----------+-----------+
                | 1000x1000 | 2000x2000 | 4000x4000 |
                +===========+===========+===========+
          numpy | 40ms      | 200ms     | 1.2s      | **
        numbagg | 200ms     | 400ms     | 1.2s      | **
          numba | *7ms      | *35ms     | 170ms     | <1>
  rust (native) | 15ms      | 46ms      | 160ms     | <2>
  rust (opencl) | 12ms      | 36ms      | *120ms    | <3>
                +-----------+-----------+-----------+
                **  - estimated using sum area table computations and factored by 3, due to
                      speed reasons at higher resolutions
                <1> - note: native numba - cached jit parallelization
                <2> - note: unoptimized - uses data copying liberally
                <3> - note: kernel building not optimized - can potentially be sped up with
                      caching

nikeethr commented 5 months ago

I just realised, that I've done fss scoring computations per window as opposed to accumulating the mean square of the window populace into a single value.

This adds an additional O(N), N = num windows, which is, in general, less than the image data size, so the final timings should not be more than double what's shown in the table above, in-fact we'll be avoiding extra operations per window. In any case, It might be a good time to compare against Nathan's code to verify - I'll add this to the benchmarking table.

Further note: since final fss score is a summation operation, as opposed to a prefix sum (which is more complex to vectorize), it can be trivially reduced further to roughly O(N log2(P + 1) / P), P = num threads.

nikeethr commented 5 months ago

Ok so I compared it against Nathan's code (fast-fss). The results are below.

As a side (but maybe important) note: I realized after re-doing the benches with Nathan's code that numba may be actually caching some of the results (I suspected this earlier). This isn't inherently a bad thing, but the benchmarks in the previous comment above may be under-estimating numba timings, due to the way timeit was used. The following timings however, are accurate and make sure to use uniquely generated inputs and local variables every iteration.

Conclusions:

Numba is ~15-20x faster (for my 4x2 core setup)
EXCEPT for the first fss computation, which probably initializes the kernel. It roughly this takes 4s in this scenario due to the relatively large size of the intermediate buffers. (This can be circumvented by using a static build and accounting for a maximum possible buffer size). Subsequent computations of fss of the same image dimensions (but different input values) are much faster.
The scores are the same to at least 6 decimal places on visual inspection (but only 4 shown in table below), for the 100 x 100 sliding window.

Side note: Odd image dimensions were used in the tests, since not all inputs will be nice NxN multiples of 1000. However, window sizes were restricted to square size as this is also the case in Nathan's code. The performance of fast-fss is roughly equivilent to what's referred to in the papers, though the problem size is 4x larger here. The improvement may be due to hardware and numpy libs maturing over the past decade.
obs_dist=N(0,1)
window_size=(100,100)

tests verified with flipped images as well; image dimensions (2429,1513) and (1513,2429) have the same results.

TEST 1: standard fcst

Setup: fcst_dist=N(0,1) threshold=0.5 image_size=(2429,1513)

     +----------+-----------+
     | fast-fss | numba <1> |
     +==========+===========+
 fss | 0.9998   | 0.9998    |
     +----------+-----------+

walltime | 2.1s | 135ms | +----------+-----------+

<1> kernel initialization: 4.7s The above result is expected since they have the same pdf. ----------------------------------------------------------- TEST 2: biased fcst ----------------------------------------------------------- Setup: fcst_dist=N(1,1) image_size=(2429,1513) threshold=0.5 +----------+-----------+ | fast-fss | numba <1> | +==========+===========+ fss | 0.7444 | 0.7444 | +----------+-----------+ walltime | 2.05s | 134ms | +----------+-----------+ <1> kernel initialization: 4.8s ----------------------------------------------------------- TEST 3: variant fcst ----------------------------------------------------------- Setup: fcst_dist=N(0,4) image_size=(2429,1513) threshold=0.5 +----------+-----------+ | fast-fss | numba <1> | +==========+===========+ fss | 0.9327 | 0.9327 | +----------+-----------+ walltime | 1.98s | 111ms | +----------+-----------+ <1> kernel initialization: 4.3s ```

tennlee commented 5 months ago

Am I reading correctly that the kernel initialization overhead is more than the time taken by numpy to do the work? Doesn't that mean that ultimately numpy is just faster? I'm not sure how often people will be doing the calculations on very large data sizes either. I am wondering if having the standard implementation be numpy, then putting another implementation into scores.optimised for the numba implementation would make sense. Maybe I have misunderstood what is going on with the kernel initialisation.

nikeethr commented 5 months ago

I was planning on base implementation be numpy, numba will be a optimised version.

Re: kernel initialization. That’s a one time thing. The timings reported are an average over 20 iterations of unique input data excluding the first run (I should have probably mentioned that). If the user is planning on processing any more than 1 field, numba is faster. (for example if aggregating over multiple levels/thresholds/time dims, numba will be a better choice)

The other thing is, even that can be preemptively avoided as mentioned earlier; as we can do a static compile rather than jit on import. Other imports also take a few seconds to initialize anyway.

nikeethr commented 5 months ago

I've added all my initial implementations in this PR: https://github.com/nci/scores/pull/266

I fully expect to be told to re-arrange most of it. I'm just keeping it there for reference atm. The numpy and numba implementations will be brought into the main scores package. The examples can be run using the instructions here: https://github.com/nci/scores/tree/188-fractions-skill-score/rust

On another note, I noticed that the reference fast_fss implementation accumulates the sum of squares before computing the scores. This can potentially overflow for larger fields - probably no one tried anything much larger than 1000x1000. In any case this doesn't affect the numpy/pure python implementations since they automagically handle big ints. We do have to be careful with using numba etc. though I only ran into the issue in rust using i32. Switching to i64 fixed it.

nikeethr commented 5 months ago

Following up from discussions, this issue will now only focus on the numpy implementation.

266 contains lumped examples as an initial version so the work is in version control.

Going forward:

266 will be updated to only contain the numpy implementation.
The numba implementation will be cherry picked out and dealt with in #269 and it's associated branch
The rust implementation(s) will be cherry picked out and dealt with in #270 and it's associated branch (note: this will be merged back against an experimental branch first)

nikeethr commented 5 months ago

Aggregation across non-spatial dims

@nicholasloveday pointed me to this: https://journals.ametsoc.org/view/journals/mwre/149/10/MWR-D-18-0106.1.xml.

From what I understand, aggregation across other dimensions, is not dissimilar to the 2-D method. However, N-D "volumetric" prefix sums or convolutions are highly non-trivial to compute in reasonable time. Double however, if we make the assumption that only the spatial dimensions are aggregated using the sliding window, and every other dimension that is being reduced is just summed up to a scalar, we can use the following simplified algorithm.

Let's assume our data set is of the form lat x lon x time x lead_time => N x M x T x L
where,
    lat x lon = spatial dims

1. [Map] Map the input field into a list of fields to be aggregated over. i.e.

        lat x lon x time x lead_time -> [lat x lon] x lead_time,
        where,
            [lat x lon] is a list of T spatial fields.
            keep_dim = lead_time; spatial_dims = lat, lon; collapse_dim = time

2. [Compute] For each lead time compute the decomposed fss score for [lat x lon] (len = T).

        The decomposed FSS score is a tuple containing the components required
        to formulate the inal score:

        (obs_sum / T, fcst_sum / T, (obs_sum - fcst_sum) / T)

        The division by T is optional, but good to have for larger datasets to
        avoid integer overflow issues when aggregating.

3. [Reduce] Sum the decomposed score across the accumulated dimension.

        fss <- [0; 1 x L]
        for each leadtime, lt {
            obs_sum <- sum_T(fss_decomposed[lt][0])
            fcst_sum <- sum_T(fss_decomposed[lt][1])
            diff_sum <- sum_T(fss_decomposed[lt][2])
            fss[lt] <- dff_sum / (obs_sum + fcst_sum)
        }
        return fss :: size := 1 x L

The reason why we need to keep track of the decomposed score instead of simply
aggregating the final score is because:

1/x + 1/y != 1 / (x+y)

The computational time above is linear if we assume all threads are saturated or we end up being memory bound. i.e. if 1 field takes 1s, aggregation over 100 fields will take 100s (excluding the reduction operation). However, such scenarios are trivially parallelisable and if numpy is restricted to a single thread, we can expect gains using dask. Though its hard to tell the actual factor of speed-up of "multi-threaded numpy" v.s. "dask + single-threaded numpy", or some hybrid without trying it out.

nci / scores

New score request: Fractions Skill Score #188

TEST 1: standard fcst

266 contains lumped examples as an initial version so the work is in version control.

266 will be updated to only contain the numpy implementation.

266 will be updated to only contain the `numpy` implementation.