Redoing hero calc because of boundary masking issue

cspencerjones commented 1 month ago

This issue is really for @TomNicholas . I decided to do this here rather than over on Slack because it probably requires a bit of thinking.

It seems like there are issues with the JPDFs. My current hypothesis is that this is caused by fill_value, which we would expect would be set here

@xgcm.as_grid_ufunc(
    boundary_width={"X": (1, 0), "Y": (1, 0)},
    boundary="fill",
    fill_value=np.nan,
    dask="parallelized",
)

But this fill_value is ignored. The way to make sure it is used is to do: ζ = vort(grid, ds.U, ds.dxC, ds.V, ds.dyC, ds.rAz, axis= 5 * [("Y", "X")], fill_value=np.nan), which @TomNicholas didn't do when he did the most recent version of the calculation. I have not yet written an xgcm issue about this.

The current lack of masking creates some horrible artifacts in the JPDFs. So it seems like I need to redo the hero calc, or abandon the JPDFs altogether. If I were to redo the calculation, I presumably would need to get a non-standard environment on LEAP, to include https://github.com/xgcm/xhistogram/pull/59? And I would need to get permissions to write to a persistent bucket? What else would I need to know (e.g. how many procs should I use)?

Is this all just more trouble than it's worth? Should I give up?

TomNicholas commented 1 month ago

It seems like there are issues with the JPDFs.

Damn. How annoying, sorry.

But this fill_value is ignored.

That doesn't sound hard to fix though.

presumably would need to get a non-standard environment on LEAP, to include https://github.com/xgcm/xhistogram/pull/59?

Pretty sure you don't need that for this calculation? It's just 3 sets of 1D bins.

And I would need to get permissions to write to a persistent bucket?

@jbusecke or @dhruvbalwada would have to help you with that.

What else would I need to know (e.g. how many procs should I use)?

Not much else. The number of procs should only affect how fast it goes, not whether or not it completes. The important thing is that you give each processor enough memory. The calculation is parallel in time, so you would just get one timestep to work and then scale it up. I would use the latest version of dask, and if that behaves weirdly you could downgrade to the one I used...

Is this all just more trouble than it's worth? Should I give up?

I can't tell you that, I can only help give an idea of how easy / hard it would be to redo (+ help fix the bugs in xGCM).

jbusecke commented 1 month ago

Sorry to hear about the trouble @cspencerjones, happy to support you with infrastructure from LEAP side!

And I would need to get permissions to write to a persistent bucket?

Are you doing this on the LEAP-hub? If so then definitely not. We would want to store the output somewhere else for the long-term, but for now write away!

It seems like there are issues with the JPDFs. My current hypothesis is that this is caused by fill_value, which we would expect would be set here

This is bad! We definitely need to fix that. Is there a simple demo that shows that setting fill_value on the decorator has no effect?

On a large note, maybe we should pad with the values of the connected faces? Might be overkill here (and we need to fix the above issue regardless) but just wanted to throw out there that we can properly pad LLC grids! I am very busy at the moment but have been wanting to push on xgcm more lately. Is there a possibility for collaboration (which would justify time investment on my side)?

Happy to chat more.

TomNicholas commented 1 month ago

On a large note, maybe we should pad with the values of the connected faces? Might be overkill here (and we need to fix the above issue regardless) but just wanted to throw out there that we can properly pad LLC grids! I am very busy at the moment but have been wanting to push on xgcm more lately. Is there a possibility for collaboration (which would justify time investment on my side)?

We deliberately avoided doing this - I think the logic was that without connected faces the vort/strain/div functions were supposed to propagate NaNs in slightly from the boundaries, and we reasoned that that should not have much effect on the histograms... Sounds like we should check that assumption explicitly though.

I would be hesitant to add that functionality to this calculation at this point, as I think it's unlikely that it (a) is simple (b) works nicely with dask without a rabbit hole of work.

jbusecke commented 1 month ago

Fair

cspencerjones commented 1 month ago

Thanks @jbusecke and @TomNicholas . It seems like it will be a bit easier than I thought.

Is there a simple demo that shows that setting fill_value on the decorator has no effect?

I just wrote an xgcm bug report with a minimal working example here: https://github.com/xgcm/xgcm/issues/652.

Are you doing this on the LEAP-hub?

Yes I am! I will have a go at writing a small amount first and ask you if I have any issues

I'm still running into some problems getting the cell with the xhistogram code to work, but I haven't really debugged it at all: I'll post again here with a minimal example if I am still stuck after I've spent some time on it.

cspencerjones commented 3 weeks ago

Hi @jbusecke. I have the code working when I use small amounts of data. When I try to use the number of timeslices per day that I want to, I keep getting disk spillage and it eventually crashes. I think I need to give the workers more memory. When I try to give workers more memory, I get this error:

AttributeError                            Traceback (most recent call last)
Cell In[72], line 6
      3 gateway = Gateway()
      4 options = gateway.cluster_options()
----> 6 options.worker_memory = 10  # 10 GB of memory per worker.
      8 # Create a cluster with those options
      9 cluster = gateway.new_cluster(options)

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/dask_gateway/options.py:113](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/dask_gateway/options.py#line=112), in Options.__setattr__(self, key, value)
    112 def __setattr__(self, key, value):
--> 113     return self._set(key, value, AttributeError)

File [/srv/conda/envs/notebook/lib/python3.12/site-packages/dask_gateway/options.py:107](https://leap.2i2c.cloud/srv/conda/envs/notebook/lib/python3.12/site-packages/dask_gateway/options.py#line=106), in Options._set(self, key, value, exc_cls)
    105     self._fields[key].set(value)
    106 except KeyError:
--> 107     raise exc_cls("No option %r available" % key) from None

AttributeError: No option 'worker_memory' available

When I look at the options for the cluster, I can see this:

but I am only presented with options with 7.2GB/CPU.

Is this something that is controlled on your end?

jbusecke commented 2 weeks ago

Hey @cspencerjones this is controlled by 2i2c. I think that 8GB/core is the highest memory ratio we can get from the workers. In the past I have specified a lower worker count (effectively 'wasting' some cores), but AFAIK this does not work anymore. Several options here:

Can we reduce the memory requirement by reducing precision and/or rechunking the data?
We could try to use coiled to get more memory for the workers (not sure if this will work though)

TomNicholas commented 2 weeks ago

When I try to use the number of timeslices per day that I want to, I keep getting disk spillage and it eventually crashes.

8GB of memory usage per worker is a lot, I would like to understand why that's even necessary. The calculation should be embarrassingly parallel in time, so each worker should only need to load the data to process one time step? Or do we then do an aggregation step that means each worker needs to load the data for 14 days? Either way we should be able to reason about how much memory is required rather than just assuming we need more.

jbusecke commented 2 weeks ago

Yeah I agree with Tom here in general, but I also understand that @cspencerjones might not have a lot of time to debug at hand. @cspencerjones how many days are you trying to aggregate?

ocean-transport / llc-hero-calc

Redoing hero calc because of boundary masking issue #1