nci / scores

Metrics for the verification, evaluation and optimisation of forecasts, predictions or models.
https://scores.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

New score request: Fractions Skill Score #188

Closed tennlee closed 4 months ago

tennlee commented 6 months ago

Scale-Selective Verification of Rainfall Accumulations from High-Resolution Forecasts of Convective Events in: Monthly Weather Review Volume 136 Issue 1 (2008) (ametsoc.org)

https://www.researchgate.net/publication/269222763_Fast_calculation_of_the_Fractions_Skill_Score

Scores users have requested the Fractions Skill Score to be added. The above two references provide information on the definition of the score and how it could be implemented.

nicholasloveday commented 6 months ago

We can confirm that the implementation works by comparing it to Nathan Faggian's implementation copied over at https://github.com/nathan-eize/fss/blob/master/fss_core.py using xr.apply_ufunc

nikeethr commented 6 months ago

The obs & fcst fields are compared against a threshold field. The result is a binary field for each.

The alg mainly focuses on a 2-D binary field (but can probably be generalized). The cumulative sum (or integral image) of the 2-D binary field is a major part of the computation; which is done once, and cached for the obs and fcst binary fields. The cached fields are then used to compute the fractionals over various windows (typically fixed-size sliding windows).

Since addition is associative, the cumulative sum computation can be decomposed quite easily into parallel operations. e.g. row first then columns (looks like this is similar to what Nathan F. has done).

Once that's done the fractional for a window can be computed using 4 reference points e.g.

     0,0                                                       
       ┌────┬────────────┐                                     
       │    │            │                                     
       │   a│           b│                                     
       ├────┼────────────┤ ▲                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │ m                                 
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │    │\\\\\\\\\\\\│ │                                   
       │   c│\\\\\\\\\\\d│ │                                   
       └────┴────────────┘ ▼                                   
            ◄────────────►                                     
                  n                                            
       shaded fractional = (d - b - c + a) / (m x n) for an arbitrary n by m window              
       where a,b,c,d are reference values from cumsum field w.r.t the bounding box of the window. 

The resulting fractionals for obs and fcst can then can be used to calculate the fractional skill score (fss) for given a set of windows (typically fixed-size, sliding).

The alg shouldn't be too complex: Cache

FSS

nikeethr commented 6 months ago

[optimization ]

This is purely optional and requires some investigation.

A potential improvement to the above algorithm, is to utilize sparsity of the binary fields. Assuming that the choice of threshold results in a sparse field; i.e. count of ones << n x m note that the sparsity assumption also holds if count of ones ~ n x m => count of zeros << n x m.

In such a scenario we can retrieve a semi-sorted set of sparse points k << n x m in the N-D space and construct a data structure (e.g. kd-tree) for quick insert and search. Each data node will contain the coordinate as well as the cumulative sum which can be computed by a similar method as above using:

  d = b + c - a + v,
  where,
  d is the cumulative sum
  v is the value at the coordinate (i,j)
  a, b and c are the closest points to (i,j) such that loc(b) < (i-1, j), loc(a) < (i-1, j-1), loc(c) < (i, j-1).
  Note that a,b,c could be from the same coordinate - this is fine.

When computing the fractional for the windows, we just take the closest data node to the each point on the bounding box of each window, and use them instead. Given that the data is sparse, these computations can also be cached; as neighbouring windows are likely to re-use the same set of four reference points.

While I believe this will greatly speed up computations for sparse fields, I'm not sure how it will perform if p(1) ~= p(0), but this should be a trivial investigation using binomial sampling. Assuming that the above only works well for sparse matrices, we could attempt to infer it. For inference, random sampling approach could be used which is computationally cheap; alternatively we could pass in an optional argument for the user to decide.

nikeethr commented 5 months ago

Implementation considerations

After some research, I believe that the python implementation referenced above provides a good trade-off between various approaches. The only thing to note is that if we intend to use dask - e.g. via xarray, it may potentially hinder in-built numpy parallelisation, and would have to be used with care.

Some notes here:

nikeethr commented 5 months ago

numba has a stencil(neighbourhood=...) convenience decorator we can also try out, which makes the implementation look straightforward: https://numba.readthedocs.io/en/stable/user/stencil.html#neighborhood. It may be worth benchmarking against a systems language like rust. I think we need to investigate the following:

As a side note, numba has the additional benefit of being able to interface to cuda quite easily. A lot of metrics in theory would benefit from accumulation operations (e.g. CDFs), so findings from this may apply to other parts of scores too.

nikeethr commented 5 months ago

Sum area table

Some brief timings show that multi-threading only really provides benefits at around 1000x1000, where the implementations are tied at around 5-20ms.

For 4000x4000 however, we have:

I still need to profile the entire sliding algorithm, however this should be around the same ballpark as its roughly the same order of operations.

Note that the vectorized implementations I tried with blas actually slowed things down. It is likely that the overhead of vectorisation is not particularly offset by its benefits. Most modern processors are capable of doing vector instructions for simpler problems without needing blas.

nikeethr commented 5 months ago

Was out of action last week. Here are some benchmarks. Meanwhile:

Conclusions:

Here's the implementation plan:

Device
------
11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
Cores: 4 x 2
Mem: 4GB
OS: WSL-2 Ubuntu 22.04 Linux
GPU: Intel Integrated Graphics (Shared memory with host cpu)

Bench
-----
                +-----------+-----------+-----------+
                | 1000x1000 | 2000x2000 | 4000x4000 |
                +===========+===========+===========+
          numpy | 40ms      | 200ms     | 1.2s      | **
        numbagg | 200ms     | 400ms     | 1.2s      | **
          numba | *7ms      | *35ms     | 170ms     | <1>
  rust (native) | 15ms      | 46ms      | 160ms     | <2>
  rust (opencl) | 12ms      | 36ms      | *120ms    | <3>
                +-----------+-----------+-----------+
                **  - estimated using sum area table computations and factored by 3, due to
                      speed reasons at higher resolutions
                <1> - note: native numba - cached jit parallelization
                <2> - note: unoptimized - uses data copying liberally
                <3> - note: kernel building not optimized - can potentially be sped up with
                      caching
nikeethr commented 5 months ago

I just realised, that I've done fss scoring computations per window as opposed to accumulating the mean square of the window populace into a single value.

This adds an additional O(N), N = num windows, which is, in general, less than the image data size, so the final timings should not be more than double what's shown in the table above, in-fact we'll be avoiding extra operations per window. In any case, It might be a good time to compare against Nathan's code to verify - I'll add this to the benchmarking table.

Further note: since final fss score is a summation operation, as opposed to a prefix sum (which is more complex to vectorize), it can be trivially reduced further to roughly O(N log2(P + 1) / P), P = num threads.

nikeethr commented 5 months ago

Ok so I compared it against Nathan's code (fast-fss). The results are below.

As a side (but maybe important) note: I realized after re-doing the benches with Nathan's code that numba may be actually caching some of the results (I suspected this earlier). This isn't inherently a bad thing, but the benchmarks in the previous comment above may be under-estimating numba timings, due to the way timeit was used. The following timings however, are accurate and make sure to use uniquely generated inputs and local variables every iteration.

Conclusions:

Side note: Odd image dimensions were used in the tests, since not all inputs will be nice NxN multiples of 1000. However, window sizes were restricted to square size as this is also the case in Nathan's code. The performance of fast-fss is roughly equivilent to what's referred to in the papers, though the problem size is 4x larger here. The improvement may be due to hardware and numpy libs maturing over the past decade.


obs_dist=N(0,1)
window_size=(100,100)

tests verified with flipped images as well; image dimensions (2429,1513) and (1513,2429) have the same results.


TEST 1: standard fcst

Setup: fcst_dist=N(0,1) threshold=0.5 image_size=(2429,1513)

     +----------+-----------+
     | fast-fss | numba <1> |
     +==========+===========+
 fss | 0.9998   | 0.9998    |
     +----------+-----------+

walltime | 2.1s | 135ms | +----------+-----------+

<1> kernel initialization: 4.7s The above result is expected since they have the same pdf. ----------------------------------------------------------- TEST 2: biased fcst ----------------------------------------------------------- Setup: fcst_dist=N(1,1) image_size=(2429,1513) threshold=0.5 +----------+-----------+ | fast-fss | numba <1> | +==========+===========+ fss | 0.7444 | 0.7444 | +----------+-----------+ walltime | 2.05s | 134ms | +----------+-----------+ <1> kernel initialization: 4.8s ----------------------------------------------------------- TEST 3: variant fcst ----------------------------------------------------------- Setup: fcst_dist=N(0,4) image_size=(2429,1513) threshold=0.5 +----------+-----------+ | fast-fss | numba <1> | +==========+===========+ fss | 0.9327 | 0.9327 | +----------+-----------+ walltime | 1.98s | 111ms | +----------+-----------+ <1> kernel initialization: 4.3s ```
tennlee commented 5 months ago

Am I reading correctly that the kernel initialization overhead is more than the time taken by numpy to do the work? Doesn't that mean that ultimately numpy is just faster? I'm not sure how often people will be doing the calculations on very large data sizes either. I am wondering if having the standard implementation be numpy, then putting another implementation into scores.optimised for the numba implementation would make sense. Maybe I have misunderstood what is going on with the kernel initialisation.

nikeethr commented 5 months ago

I was planning on base implementation be numpy, numba will be a optimised version.

Re: kernel initialization. That’s a one time thing. The timings reported are an average over 20 iterations of unique input data excluding the first run (I should have probably mentioned that). If the user is planning on processing any more than 1 field, numba is faster. (for example if aggregating over multiple levels/thresholds/time dims, numba will be a better choice)

The other thing is, even that can be preemptively avoided as mentioned earlier; as we can do a static compile rather than jit on import. Other imports also take a few seconds to initialize anyway.

nikeethr commented 5 months ago

I've added all my initial implementations in this PR: https://github.com/nci/scores/pull/266

I fully expect to be told to re-arrange most of it. I'm just keeping it there for reference atm. The numpy and numba implementations will be brought into the main scores package. The examples can be run using the instructions here: https://github.com/nci/scores/tree/188-fractions-skill-score/rust

On another note, I noticed that the reference fast_fss implementation accumulates the sum of squares before computing the scores. This can potentially overflow for larger fields - probably no one tried anything much larger than 1000x1000. In any case this doesn't affect the numpy/pure python implementations since they automagically handle big ints. We do have to be careful with using numba etc. though I only ran into the issue in rust using i32. Switching to i64 fixed it.

nikeethr commented 5 months ago

Following up from discussions, this issue will now only focus on the numpy implementation.

266 contains lumped examples as an initial version so the work is in version control.

Going forward:

nikeethr commented 5 months ago

Aggregation across non-spatial dims

@nicholasloveday pointed me to this: https://journals.ametsoc.org/view/journals/mwre/149/10/MWR-D-18-0106.1.xml.

From what I understand, aggregation across other dimensions, is not dissimilar to the 2-D method. However, N-D "volumetric" prefix sums or convolutions are highly non-trivial to compute in reasonable time. Double however, if we make the assumption that only the spatial dimensions are aggregated using the sliding window, and every other dimension that is being reduced is just summed up to a scalar, we can use the following simplified algorithm.

Let's assume our data set is of the form lat x lon x time x lead_time => N x M x T x L
where,
    lat x lon = spatial dims

1. [Map] Map the input field into a list of fields to be aggregated over. i.e.

        lat x lon x time x lead_time -> [lat x lon] x lead_time,
        where,
            [lat x lon] is a list of T spatial fields.
            keep_dim = lead_time; spatial_dims = lat, lon; collapse_dim = time

2. [Compute] For each lead time compute the decomposed fss score for [lat x lon] (len = T).

        The decomposed FSS score is a tuple containing the components required
        to formulate the inal score:

        (obs_sum / T, fcst_sum / T, (obs_sum - fcst_sum) / T)

        The division by T is optional, but good to have for larger datasets to
        avoid integer overflow issues when aggregating.

3. [Reduce] Sum the decomposed score across the accumulated dimension.

        fss <- [0; 1 x L]
        for each leadtime, lt {
            obs_sum <- sum_T(fss_decomposed[lt][0])
            fcst_sum <- sum_T(fss_decomposed[lt][1])
            diff_sum <- sum_T(fss_decomposed[lt][2])
            fss[lt] <- dff_sum / (obs_sum + fcst_sum)
        }
        return fss :: size := 1 x L

The reason why we need to keep track of the decomposed score instead of simply
aggregating the final score is because:

1/x + 1/y != 1 / (x+y)

The computational time above is linear if we assume all threads are saturated or we end up being memory bound. i.e. if 1 field takes 1s, aggregation over 100 fields will take 100s (excluding the reduction operation). However, such scenarios are trivially parallelisable and if numpy is restricted to a single thread, we can expect gains using dask. Though its hard to tell the actual factor of speed-up of "multi-threaded numpy" v.s. "dask + single-threaded numpy", or some hybrid without trying it out.