Closed larsyencken closed 1 week ago
Heads up @veronikasamborska1994 and @Marigold. The nightly build isn't blocking anyone, so it's not super urgent to fix. Is this the step that had already been optimised a lot for memory usage?
Is this the step that had already been optimised a lot for memory usage?
Yes, here they are. I don't remember how much memory it took, but it ran on my laptop with 16gb. There are even some notes what to do to improve performance.
Several hypotheses from the top of my head why it stopped working:
weird - it is taking much longer than it used to, but I don't think we are processing way more data now than a month ago! so more likely something to do with the rioxarray update? we can always still move this step to snapshot if it continues to cause issues
Downgrading rioxarray didn't help (see https://github.com/owid/etl/pull/2832). The step is fast, but consumes ~40gb memory. We can at least use that staging server to profile it.
I've tried using an earlier version of the data that used to work ok, but that also hasn't helped.
Some ad-hoc memory optimizations didn't help. It fails on
da = ds.sel(expver=1).combine_first(ds.sel(expver=5))["t2m"]
That line must be causing a temporary memory peak which passes our memory limits. I'm trying something else.
I did it!!!!! It was totally not worth the time, but at least I can sleep again...
(It was a good use case for the new line profiling CLI.)
Problem
Our regular ETL runs can accumulate state on the server, which means they don't always exercise the ETL end-to-end. To make sure it can always build cleanly from scratch, we run a nightly build.
The nightly build currently fails the surface temperature step below with an out of memory error, where it exceeds the 32GB we allow for an ETL step. The memory constraint is there to ensure that the ETL can be run locally on a "standard" laptop of 16GB ram + swap.
We can see from earlier runs that it sometimes does succeed within this memory envelope. It's not clear why it would fail now.
Traceback
Traceback below
``` Traceback (most recent call last): -- | File "