napari / napari

napari: a fast, interactive, multi-dimensional image viewer for python
https://napari.org
BSD 3-Clause "New" or "Revised" License
2.1k stars 410 forks source link

Making Points layer creation and update faster #6727

Closed jules-vanaret closed 4 weeks ago

jules-vanaret commented 3 months ago

References and relevant issues

I work with large points datasets (~10⁶ vertices), and creating/setting data of a Points layer is rather slow, on the order of 2 seconds on my machine). I have dug into the code and found that the culprit is this function to handle symbols associated to points in napari/layers/points/_points_utils.py, which takes around 80% of the full computation time:

def coerce_symbols(array: Iterable) -> np.ndarray:
    """
    Parse an array of symbols and convert it to the correct strings.

    Ensures that all strings are valid symbols and converts aliases.

    Parameters
    ----------
    array : np.ndarray
        Array of strings matching Symbol values.
    """
    # dtype has to be object, otherwise np.vectorize will cut it down to `U(N)`,
    # where N is the biggest string currently in the array.
    array = [SYMBOL_ALIAS.get(k, k) for k in (str(x).lower() for x in array)]
    return np.vectorize(Symbol, otypes=[object])(array)

The purpose of coerce_symbols is to transform "raw" (e.g string like 'o') points symbols into proper Symbol instances.

Note that even if symbol is not specified when creating the Points layer (and then it is just equal to 'o'), or specified with a unique symbol for all points, the symbol setter in napari/layers/points/points.py:

@symbol.setter
def symbol(self, symbol: Union[str, np.ndarray, list]) -> None:
    symbol = np.broadcast_to(symbol, self.data.shape[0])

immediately transform this unique symbol into a potentially huge numpy array of size N_vertices.

The main reason for the fact that coerce_symbols is slow in the case of large N_vertices is because it loops twice through this huge numpy array of (potentially simply duplicated) symbols: once with (str(x).lower() for x in array) and more or less once with np.vectorize(Symbol, otypes=[object])(array). I believe this can be improved with minor changes

Description

The PR stems from two ideas to reduce computation time.

  1. if the user provides no symbol when creating the Point layer, or provides a single string or Symbol instance, it is much faster to first convert it to a proper Symbol instance and THEN to broadcast it to an array of size N_vertices.
  2. if the user provides a list of N_vertices symbols for the points (composed of strings or Symbol instances), there are not that many valid symbols (currently 14 different ones) that they can choose from. This means that instead of converting each string symbol into a Symbol instance individually, it is faster to first find all unique raw symbols from symbol provided by the user, convert them in to Symbol, and then simply map each raw symbol to its corresponding Symbol instance.

The new coerce_symbols function I propose looks like this:

def coerce_symbols(symbol: Union[str, Symbol, np.ndarray, list]) -> np.ndarray:
    """
    Parse an array of symbols and convert it to the correct strings.

    Ensures that all strings are valid symbols and converts aliases.

    Parameters
    ----------
    array : np.ndarray
        Array of strings matching Symbol values.
    """
    # if symbol is a unique string or Symbol instance, convert it to a
    # proper Symbol instance
    if isinstance(symbol, (str, Symbol)):
        return np.array(symbol_conversion(symbol), dtype=object)

    # otherwise, create a dictionary to map raw symbols to their
    # Symbol instance counterpart
    symbol_dict = create_symbol_dict(symbol)
    # then use a vectorized "dict.get" to convert the raw symbols to
    # their Symbol instance counterpart quickly
    return fast_dict_get(symbol, symbol_dict)

Instead of operating on an already-broadcasted array of symbols, it can take as input either a single symbol, or a list of symbols.

The symbol setter in napari/layers/points/points.py is modified to look like this:

@symbol.setter
def symbol(self, symbol: Union[str, np.ndarray, list]) -> None:
    coerced_symbols = coerce_symbols(symbol)
    coerced_symbols = np.broadcast_to(
        coerced_symbols, self.data.shape[0]
    )

It now tries to broadcast symbols to the number of vertices AFTER converting them to proper Symbol instances.

Finally, coerce_symbols is also used in the current_symbol setter, but this new implementation does not interfere with its current behavior.

Performance changes

Qualitative performance changes can be assessed using the following script, which simulates a random points dataset with N_vertices vertices.

import numpy as np
import napari
from time import time
from napari.layers.points._points_constants import SYMBOL_ALIAS, Symbol

np.random.seed(2024)

N_vertices = 1_000_000
symbols = list(SYMBOL_ALIAS.keys())
symbols = np.random.choice(symbols, N_vertices).tolist()

# simulate 3D points with 4th dimension (time)
points = np.random.rand(N_vertices, 4) * 100

t0 = time()
points_layer = napari.layers.Points(points, name='points')
print(f"Creating layer with {N_vertices} vertices: {time() - t0:.4f} seconds")

t0 = time()
points_layer = napari.layers.Points(points, name='points', symbol=symbols)
print(f"Creating layer with {N_vertices} vertices and as many symbols: {time() - t0:.4f} seconds")

t0 = time()
points_layer = napari.layers.Points(points, name='points', symbol=Symbol.RING)
print(f"Creating layer with {N_vertices} vertices and a Symbol instance: {time() - t0:.4f} seconds")
For different values of N_vertices, I time the following duration to define the Points layer: Old implementation New implementation
N_vertices = 10_000 0.06s, 0.04s, 0.03s 0.04s, 0.0075s, 0.0065s
N_vertices = 1_000_000 1.35s, 1.45s, 1.05s 0.24s, 0.24s, 0.20s
N_vertices = 10_000_000 13.3s, 14.0s, 18.0s 1.94s, 2.48s, 1.92s

Note that the proposed implementation has a much nicer scaling with the number of vertices.

Benchmarks

From what I have been able to see, the current benchmarking suite already computes the time it takes to set the data of a Point layer, so there is no need to test for anything else.

Final Checklist

jules-vanaret commented 3 months ago

I have realized that broadcast_to was also used as a check that the input symbol parameter had the right shape if it was given as a list or an array. I have added it back (not only for unique symbols).

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.44%. Comparing base (1dcbade) to head (762426e). Report is 2 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #6727 +/- ## ========================================== - Coverage 92.45% 92.44% -0.01% ========================================== Files 617 617 Lines 55166 55181 +15 ========================================== + Hits 51001 51010 +9 - Misses 4165 4171 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

psobolewskiPhD commented 3 months ago

This looks to be addressing: https://github.com/napari/napari/issues/6275 Have you checked with vispy main https://github.com/vispy/vispy/pull/2533 ?

DragaDoncila commented 3 months ago

Thanks for the fantastic PR description @jules-vanaret, and for your attention on the points layer 👀

This looks to be addressing: https://github.com/napari/napari/issues/6275

@psobolewskiPhD yep I'd say you're right. Based on the profiling and discussion in that thread it looks like both the vispy.set_data and our own coerce_symbols could be improved. Unless vispy/#2533 removes the need for us to coerce_symbols, it sounds like this PR would be an improvement regardless of the vispy PR?

jules-vanaret commented 3 months ago

Thank you both for taking a look! @psobolewskiPhD all the tests were made on a fresh environment, with Vispy 0.14.1, which I believe is the most recent released version. Do you think a new version of Vispy in the near future will make this PR obsolete ?

Czaki commented 3 months ago

@brisvag Did you know what is required to release vispy 0.14.2?

brisvag commented 3 months ago

Nice work, I was never happy about this code ^^' @Czaki releasing a patch version for vispy can be done whenever we like I think.

Czaki commented 3 months ago

So maybe you could make release, and we could check how this PR works with the new vispy?

brisvag commented 3 months ago

Vispy 0.14.2 released :)

psobolewskiPhD commented 3 months ago

FYI I ran the script from the OP (n= 10 mil) with newest vispy and psygnal and I get:

Creating layer with 10000000 vertices: 11.0905 seconds
Creating layer with 10000000 vertices and as many symbols: 12.3066 seconds
Creating layer with 10000000 vertices and a Symbol instance: 7.7045 seconds

Virtually the same prior to updating them. However with this PR:

Creating layer with 10000000 vertices: 0.9560 seconds
Creating layer with 10000000 vertices and as many symbols: 1.3793 seconds
Creating layer with 10000000 vertices and a Symbol instance: 0.8674 seconds

Adding a layer using viewer.add_points(data) (data = rng.random(size=(10_000_000, 2)) * 1000) takes <10 s, which is >10s faster than the same on main.

jules-vanaret commented 3 months ago

Whoops you beat me to it ! :) I report the same orders of magnitude on my end with the newest Vispy.

jules-vanaret commented 2 months ago

@Czaki should I run additional benchmarks or add comments to the code ?

Czaki commented 2 months ago

I will try to review tomorrow. Please resolve the conflicts

DragaDoncila commented 1 month ago

@Czaki any further comments on this one?

jules-vanaret commented 1 month ago

Non-Qt benchmarks failed because one benchmark in the ImageSuite was slower (I don't know how this PR affects images)... For Qt benchmarks, I don't understand the error. Is it due to the benchmarks taking too long ?

psobolewskiPhD commented 1 month ago

I've re-run the non-Qt ones. Looked like a glitch in there, where suddenly a few jumped to >1s.

Czaki commented 1 month ago

It looks strange. I have merged main into this PR. I have also run benchmarks in another PR to check how it looks.

Czaki commented 1 month ago

Hm. Problem of timeout of run benchmark seams to disappear. Running benchmarks locally do not show slowdown.

Let's wait on this run of benchmarks, but I hope That I could add ready to merge label in a few hours.

github-actions[bot] commented 1 month ago

The Qt benchmark run requested by PR #6727 (e37f69c7bdacc85cdf683bd6978990c2e0e6e465 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details. The non-Qt benchmark run requested by PR #6727 (e37f69c7bdacc85cdf683bd6978990c2e0e6e465 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details.

github-actions[bot] commented 1 month ago

The Qt benchmark run requested by PR #6727 (f41f4862080d7a63c5b62537225b914cf8169ec6 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details. The non-Qt benchmark run requested by PR #6727 (f41f4862080d7a63c5b62537225b914cf8169ec6 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details.

github-actions[bot] commented 1 month ago

The Qt benchmark run requested by PR #6727 (f41f4862080d7a63c5b62537225b914cf8169ec6 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details. The non-Qt benchmark run requested by PR #6727 (f41f4862080d7a63c5b62537225b914cf8169ec6 vs 2a2205218ac1795c6a59b06c63df6afac016bdf2) has finished with status 'failure'. See the CI logs and artifacts for further details.

Czaki commented 1 month ago

It looks like a global translation dict solve the performance problems:

| Change   | Before [2a220521]    | After [f41f4862]    |   Ratio | Benchmark (Parameter)                                          |
|----------|----------------------|---------------------|---------|----------------------------------------------------------------|
| -        | 16.2±0.2ms           | 9.19±0.1ms          |    0.57 | benchmark_points_layer.Points3DSuite.time_create_layer(4096)   |
| -        | 15.5±0.3ms           | 8.24±0.04ms         |    0.53 | benchmark_points_layer.Points2DSuite.time_create_layer(4096)   |
| -        | 40.8±0.9ms           | 12.5±0.06ms         |    0.31 | benchmark_points_layer.Points2DSuite.time_create_layer(16384)  |
| -        | 40.8±0.3ms           | 12.5±0.1ms          |    0.31 | benchmark_points_layer.Points3DSuite.time_create_layer(16384)  |
| -        | 141±2ms              | 30.0±0.6ms          |    0.21 | benchmark_points_layer.Points2DSuite.time_create_layer(65536)  |
| -        | 138±0.9ms            | 27.4±0.6ms          |    0.2  | benchmark_points_layer.Points3DSuite.time_create_layer(65536)  |
| -        | 1.66±0.01s           | 62.4±1ms            |    0.04 | benchmark_python_layer.CoerceSymbolsSuite.time_coerce_symbols1 |
| -        | 1.66±0.01s           | 63.3±0.5ms          |    0.04 | benchmark_python_layer.CoerceSymbolsSuite.time_coerce_symbols2 |

@jules-vanaret are you ok with these changes?

For typing, I prefer to wait on #6866 as it will bump mypy and current output is false negative.

jules-vanaret commented 1 month ago

Sorry for the delay. Thank you for your work @Czaki, precomputing the dict makes much more sense ! Some tests are still failing though.

Czaki commented 1 month ago

Yes, but I prefer to merge first #6866, as it will impact failing test.

github-actions[bot] commented 1 month ago

The Qt benchmark run requested by PR #6727 (0b86e447346d74eaf37e1fbb2cb1fedab5a049ba vs 1dcbadea20149c43d49764afd3fe8b39b76998a9) has finished with status 'success'. See the CI logs and artifacts for further details. The non-Qt benchmark run requested by PR #6727 (0b86e447346d74eaf37e1fbb2cb1fedab5a049ba vs 1dcbadea20149c43d49764afd3fe8b39b76998a9) has finished with status 'success'. See the CI logs and artifacts for further details.

Czaki commented 1 month ago

@jules-vanaret Could you check current state? I have cleanup code a little and tweak documentation. For me, it is close to merging.

@psobolewskiPhD could you also check?

psobolewskiPhD commented 1 month ago

Everything still looks good to me. it's even faster:

Creating layer with 1000000 vertices: 0.1736 seconds
Creating layer with 1000000 vertices and as many symbols: 0.3118 seconds
Creating layer with 1000000 vertices and a Symbol instance: 0.0992 seconds
jules-vanaret commented 1 month ago

I just rechecked the code and some cases are indeed faster (cases that are not are marginally so). From my point of view, everything seems ok!

jules-vanaret commented 4 weeks ago

Thanks @jni for merging the PR! :) And thank you @brisvag for the Vispy push, @psobolewskiPhD for the thorough benchmarking, and particularly @Czaki for all your work on optimization, clean up, and general coordination of the PR.