leap-stc / wavewatch3_feedstock

Apache License 2.0
0 stars 0 forks source link

Super high Memory consumption #2

Open jbusecke opened 4 months ago

jbusecke commented 4 months ago

I got the pruned version of this to run with super large workers, but the full recipe seems to even blow these workers out of memory:

image

We previously noted that these files are super heavily compressed, but one file will only be ~20 GB in memory, what is ramping the memory up to past 100GB?

jbusecke commented 4 months ago

Perhaps in this case turning of gc is not a great solution?

jbusecke commented 4 months ago

Checking that in #1 (https://github.com/leap-stc/wavewatch3_feedstock/pull/1/commits/2659b1bcaf2eff9d34d6fd2618f59f9a3caee333)

jbusecke commented 4 months ago

No that was not it. Let me try to drop all but one variable.

jbusecke commented 4 months ago

Ok this is riduculous. I have given this extremely large workers (800GB RAM) and still see the same pattern:

image

From the logs I can tell that it is starting to combine every one of the output chunks (87000 time steps/100 time steps perchunk ~800 - and we see those chunks being combined). At the point where the RAM fills up, not a single chunk of the data has been written.

image

There is something super wrong here, somehow this is neither scaling up the workers, nor streaming the work through the one very large worker.

jbusecke commented 4 months ago

Ok now it scaled up the workers to 2 (I set a limit of 3 because these could get very expensive).

jbusecke commented 4 months ago

Yeah this is horse 💩: Maybe If I give enough workers to load the entire dataset into distributed memory it would work, but that is not really how this is supposed to work:

image