Open lewisblake opened 4 months ago
Results from profiling one instance of the test:
calc_statistics
takes only about 11% of the time whereas load_region_mask_xr
takes about 60% of the time. A lot of this in built-in numpy methods, but still ideally we wouldn't be calling a function which loads in a data set within a for-loop.
Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.
It might also be writing of ts-json files. The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G
Has this test been run with a long time-series with more than one year of hourly data (8760 timesteps at least)? We don't see a bottleneck for short timeseries.
No, this was just a quick first look at profiling tests/aeroval/test_coldatatojson_helpers2.py::test__process_statistics_timeseries
It might also be writing of ts-json files. The processing of regional statistics of timeseries for a 3year dataset of hourly data takes ~46min and writes out 1686 json-files (one for each station) with a total of 5.3G
Agreed, but I'd need to take a deeper look into the logs.
A short IO test with one of the output-files:
import timeit
from pyaerocom.aeroval.json_utils import read_json, write_json
# file with 46156617 bytes size
data = read_json("ALL-EEA-UTD-concno2-Surface.json")
i = 0
def write():
global i
i += 1
write_json(
data_dict=data, file_path=f"out{i}.json", ignore_nan=True, round_floats=True
)
number = 5
timeout = timeit.timeit(
"write()",
setup="from pyaerocom.aeroval.json_utils import read_json, write_json",
globals=globals(),
number=number,
)
print(
f"write of {number} files took {timeout}s -> {number*46156617/1024/1024/timeout} MB/s"
)
Running on the same machine and queue, I get a output speed of:
We are writing 7.7Gb to /lustre/storeB/users/charlien/cams2-83-tests/data/cams2-83/test-forecast-long-hourlystats/hm/ts/, with 2.8MB/s it takes: 2816s, with 49.0MB/s this is reduced to 160s.
The dataformat of this file does not allow for efficient numpy-based processing (dict of dicts). The data in .../test-forecast-long-hourlystats/ts/
is 5.3Gb and written much faster since all processing is numpy based (dict of numpy arrays), even if it still uses float-conversion.
Questions to @AugustinMortier :
{
"concno2": {
"EEA-UTD": {
"Surface": {
"ENSEMBLE": {
"concno2": {
"Germany": {
"1622507400000": {
"totnum": 245.0,
"weighted": 0.0,
"num_valid": 0.0,
"refdata_mean": null,
"refdata_std": null,
"data_mean": null,
to
{
"concno2": {
"EEA-UTD": {
"Surface": {
"ENSEMBLE": {
"concno2": {
"Germany": {
"timesteps": [1622507400000, ...],
"totnum": [245.0, ...],
"weighted": [0.0, ...],
"num_valid": [0.0, ...],
"refdata_mean": [null, ...],
"refdata_std": [null, ...],
"data_mean": [null, ...],
This would allow us to process the statistics for all timesteps in the same numpy operation (using axis=X
) rather than processing each timestep after each other.
Why are stats already rounded by default? https://github.com/metno/pyaerocom/blob/8913284b9bba4117f04f443036fedd090c25317d/pyaerocom/stats/stats.py#L83
(Introduced as a fairmode rounding issue fix? @thorbjoernl And it is python rounding of numpy values, why not np.around and avoid np/float conversion?)
Currently, we are rounding twice.
It fixed an issue where R values where out of bounds which caused a test to fail: https://github.com/metno/pyaerocom/pull/1142
There is probably a better way to fix this.
Next step might be to vectorize statistics:
data = np.random.random((10000, 100))
def flattenmean():
for i in np.arange(data.shape[0]):
df = data[i,:].flatten()
np.nanmean(df)
def axismean():
np.nanmean(data, axis=1)
Currently we are working purely on the flattened arrays. If we manage to vectorize that as in the axismean version, it will be 50x faster (just benchmarked).
Part of the reason we compute the statistics on the flattened arrays is (used to be) because the filtering is applied at the station time series level. So we would like a way with numpy to apply the 75% temporal coverage requirement to the array directly, or really just some way to apply the filtering before computing the statistics. We should also check that we are not doing the temporal filtering twice.
Idea: pytest has a plugin called timeout. Once optimized, we can write a test to check that the code does not take more than a certain amount of time. If changes are introduced which cause the test to take longer, the test fails.
Describe the bug Please provide a clear and concise description of what the bug is.
There is a performance bug in the computing of the regional timeseries. Culprit is
_process_statistics_timeseries
incoldatatojson_helpers.py
To Reproduce Steps to reproduce the behavior:
Expected behavior Hope to significantly reduce runtime of this function.
Additional context Pretty much all projects would benefit from this.