owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
58 stars 18 forks source link

Surface temperature data fails nightly ETL due to memory usage #2826

Closed larsyencken closed 1 week ago

larsyencken commented 2 weeks ago

Problem

Our regular ETL runs can accumulate state on the server, which means they don't always exercise the ETL end-to-end. To make sure it can always build cleanly from scratch, we run a nightly build.

The nightly build currently fails the surface temperature step below with an out of memory error, where it exceeds the 32GB we allow for an ETL step. The memory constraint is there to ensure that the ETL can be run locally on a "standard" laptop of 16GB ram + swap.

COMMAND: prlimit --as=34359738368 poetry run etl d run-python-step data://meadow/climate/2023-12-20/surface_temperature /var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/data/meadow/climate/2023-12-20/surface_temperature

We can see from earlier runs that it sometimes does succeed within this memory envelope. It's not clear why it would fail now.

Traceback

Traceback below ``` Traceback (most recent call last): --   | File "", line 1, in   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/rich_click/rich_command.py", line 367, in __call__   | return super().__call__(*args, **kwargs)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__   | return self.main(*args, **kwargs)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main   | rv = self.invoke(ctx)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke   | return _process_result(sub_ctx.command.invoke(sub_ctx))   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke   | return _process_result(sub_ctx.command.invoke(sub_ctx))   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke   | return ctx.invoke(self.callback, **ctx.params)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke   | return __callback(*args, **kwargs)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/etl/run_python_step.py", line 33, in main   | _import_and_run(path, dest_dir)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/etl/run_python_step.py", line 53, in _import_and_run   | module.run(dest_dir)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/etl/steps/data/meadow/climate/2023-12-20/surface_temperature.py", line 55, in run   | da = _load_data_array(snap) - 273.15   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/etl/steps/data/meadow/climate/2023-12-20/surface_temperature.py", line 34, in _load_data_array   | return ds.sel(expver=1).combine_first(ds.sel(expver=5))["t2m"]   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/dataset.py", line 6817, in combine_first   | out = ops.fillna(self, other, join="outer", dataset_join="outer")   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/ops.py", line 148, in fillna   | return apply_ufunc(   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 1253, in apply_ufunc   | return apply_dataset_vfunc(   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 528, in apply_dataset_vfunc   | result_vars = apply_dict_of_variables_vfunc(   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 452, in apply_dict_of_variables_vfunc   | result_vars[name] = func(*variable_args)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 730, in apply_variable_ufunc   | input_data = [   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 732, in   | broadcast_compat_data(arg, broadcast_dims, core_dims)   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/computation.py", line 653, in broadcast_compat_data   | data = variable.data   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/variable.py", line 448, in data   | return self._data.get_duck_array()   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/indexing.py", line 833, in get_duck_array   | self._ensure_cached()   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/indexing.py", line 827, in _ensure_cached   | self.array = as_indexable(self.array.get_duck_array())   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/indexing.py", line 784, in get_duck_array   | return self.array.get_duck_array()   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/core/indexing.py", line 654, in get_duck_array   | array = array.get_duck_array()   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/coding/variables.py", line 81, in get_duck_array   | return self.func(self.array.get_duck_array())   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/coding/variables.py", line 81, in get_duck_array   | return self.func(self.array.get_duck_array())   | File "/var/lib/buildkite-agent/builds/etl-build-2-12/our-world-in-data/etl-full-private-rebuild-nightly/.venv/lib/python3.10/site-packages/xarray/coding/variables.py", line 232, in _apply_mask   | return np.where(condition, decoded_fill_value, data)   | numpy.core._exceptions._ArrayMemoryError: Unable to allocate 7.84 GiB for an array with shape (1013, 721, 1440) and data type float64 ```
larsyencken commented 2 weeks ago

Heads up @veronikasamborska1994 and @Marigold. The nightly build isn't blocking anyone, so it's not super urgent to fix. Is this the step that had already been optimised a lot for memory usage?

Marigold commented 2 weeks ago

Is this the step that had already been optimised a lot for memory usage?

Yes, here they are. I don't remember how much memory it took, but it ran on my laptop with 16gb. There are even some notes what to do to improve performance.

Several hypotheses from the top of my head why it stopped working:

veronikasamborska1994 commented 2 weeks ago

weird - it is taking much longer than it used to, but I don't think we are processing way more data now than a month ago! so more likely something to do with the rioxarray update? we can always still move this step to snapshot if it continues to cause issues

Marigold commented 2 weeks ago

Downgrading rioxarray didn't help (see https://github.com/owid/etl/pull/2832). The step is fast, but consumes ~40gb memory. We can at least use that staging server to profile it.

veronikasamborska1994 commented 2 weeks ago

I've tried using an earlier version of the data that used to work ok, but that also hasn't helped.

Marigold commented 1 week ago

Some ad-hoc memory optimizations didn't help. It fails on

da = ds.sel(expver=1).combine_first(ds.sel(expver=5))["t2m"]

That line must be causing a temporary memory peak which passes our memory limits. I'm trying something else.

Marigold commented 1 week ago

I did it!!!!! It was totally not worth the time, but at least I can sleep again...

(It was a good use case for the new line profiling CLI.)

image