metno / pyaerocom

Python tools for climate and air quality model evaluation
https://pyaerocom.readthedocs.io/
GNU General Public License v3.0
26 stars 15 forks source link

Optimize performance of _process_statistics_timeseries #1277

Open lewisblake opened 4 months ago

lewisblake commented 4 months ago

Describe the bug Please provide a clear and concise description of what the bug is.

There is a performance bug in the computing of the regional timeseries. Culprit is _process_statistics_timeseries in coldatatojson_helpers.py

To Reproduce Steps to reproduce the behavior:

  1. Create a test of the aforementioned function.
  2. Profile test
  3. Optimize performance
  4. Profile again

Expected behavior Hope to significantly reduce runtime of this function.

Additional context Pretty much all projects would benefit from this.

lewisblake commented 4 months ago

Results from profiling one instance of the test: combined

calc_statistics takes only about 11% of the time whereas load_region_mask_xr takes about 60% of the time. A lot of this in built-in numpy methods, but still ideally we wouldn't be calling a function which loads in a data set within a for-loop.

heikoklein commented 4 months ago

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

heikoklein commented 4 months ago

It might also be writing of ts-json files. The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

lewisblake commented 4 months ago

Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.

No, this was just a quick first look at profiling tests/aeroval/test_coldatatojson_helpers2.py::test__process_statistics_timeseries

It might also be writing of ts-json files. The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G

Agreed, but I'd need to take a deeper look into the logs.

heikoklein commented 4 months ago

A short IO test with one of the output-files:

import timeit

from pyaerocom.aeroval.json_utils import read_json, write_json

# file with 46156617 bytes size
data = read_json("ALL-EEA-UTD-concno2-Surface.json")

i = 0

def write():
    global i
    i += 1
    write_json(
        data_dict=data, file_path=f"out{i}.json", ignore_nan=True, round_floats=True
    )

number = 5
timeout = timeit.timeit(
    "write()",
    setup="from pyaerocom.aeroval.json_utils import read_json, write_json",
    globals=globals(),
    number=number,
)

print(
    f"write of {number} files took {timeout}s -> {number*46156617/1024/1024/timeout} MB/s"
)

Running on the same machine and queue, I get a output speed of:

We are writing 7.7Gb to /lustre/storeB/users/charlien/cams2-83-tests/data/cams2-83/test-forecast-long-hourlystats/hm/ts/, with 2.8MB/s it takes: 2816s, with 49.0MB/s this is reduced to 160s.

The dataformat of this file does not allow for efficient numpy-based processing (dict of dicts). The data in .../test-forecast-long-hourlystats/ts/ is 5.3Gb and written much faster since all processing is numpy based (dict of numpy arrays), even if it still uses float-conversion.

Questions to @AugustinMortier :

  1. Can we drop rounding of this file? (short term)
  2. Can we change the file format: Current:
    {
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "1622507400000": {
                                "totnum": 245.0,
                                "weighted": 0.0,
                                "num_valid": 0.0,
                                "refdata_mean": null,
                                "refdata_std": null,
                                "data_mean": null,

    to

    {
    "concno2": {
        "EEA-UTD": {
            "Surface": {
                "ENSEMBLE": {
                    "concno2": {
                        "Germany": {
                            "timesteps": [1622507400000, ...],
                            "totnum": [245.0, ...],
                            "weighted": [0.0, ...],
                            "num_valid": [0.0, ...],
                            "refdata_mean": [null, ...],
                            "refdata_std": [null, ...],
                            "data_mean": [null, ...],

    This would allow us to process the statistics for all timesteps in the same numpy operation (using axis=X) rather than processing each timestep after each other.

heikoklein commented 4 months ago

Why are stats already rounded by default? https://github.com/metno/pyaerocom/blob/8913284b9bba4117f04f443036fedd090c25317d/pyaerocom/stats/stats.py#L83

(Introduced as a fairmode rounding issue fix? @thorbjoernl And it is python rounding of numpy values, why not np.around and avoid np/float conversion?)

Currently, we are rounding twice.

thorbjoernl commented 4 months ago

It fixed an issue where R values where out of bounds which caused a test to fail: https://github.com/metno/pyaerocom/pull/1142

There is probably a better way to fix this.

heikoklein commented 4 months ago

Next step might be to vectorize statistics:

data = np.random.random((10000, 100))

def flattenmean():
    for i in np.arange(data.shape[0]):
        df = data[i,:].flatten()
        np.nanmean(df)

def axismean():
    np.nanmean(data, axis=1)

Currently we are working purely on the flattened arrays. If we manage to vectorize that as in the axismean version, it will be 50x faster (just benchmarked).

lewisblake commented 4 months ago

Part of the reason we compute the statistics on the flattened arrays is (used to be) because the filtering is applied at the station time series level. So we would like a way with numpy to apply the 75% temporal coverage requirement to the array directly, or really just some way to apply the filtering before computing the statistics. We should also check that we are not doing the temporal filtering twice.

lewisblake commented 4 months ago

Idea: pytest has a plugin called timeout. Once optimized, we can write a test to check that the code does not take more than a certain amount of time. If changes are introduced which cause the test to take longer, the test fails.