leap-stc / cmip6-leap-feedstock

Apache License 2.0
12 stars 5 forks source link

Smarter way to identify zombies or failed jobs? #120

Open jbusecke opened 6 months ago

jbusecke commented 6 months ago

The runtime as indicator can become fairly expensive and is not really reliable. What we really want to identify is jobs that are not doing anything anymore (stalled, e.g. low CPU, constant RAM)

and maybe jobs that show too many errors. This guy over here&authuser=1) seemed utterly broken (i honestly have no clue what is going on) and has over 50 errors raised. Its scaled up to a lot of workers and thus costing quite a bit of money. I opted to kill it because running this for another few hours might have just caused 100s of $ of cost. Another indicator why this particular one is broken: It did not write ANYTHING, not even the skeleton zarr. I feel if this is true after a while, we should kill these too.

jbusecke commented 6 months ago

Oh I think that job actually failed!