pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.57k stars 1.07k forks source link

xr.doctor(): diagnostics on a Dataset / DataArray ? #6308

Open benbovy opened 2 years ago

benbovy commented 2 years ago

Is your feature request related to a problem?

Recently I've been reading through various issue reports here and there (GH issues and discussions, forums, etc.) and I'm wondering if it wouldn't be useful to have some function in Xarray that inspects a Dataset or DataArray and reports a bunch of diagnostics, so that the community could better help troubleshooting performance or other issues faced by users.

It's not always obvious where to look (e.g., number of chunks of a dask array, number of tasks of a dask graph, etc.) to diagnose issues, sometimes even for experienced users.

Describe the solution you'd like

A xr.doctor(dataset_or_dataarray) top-level function (or Dataset.doctor() / DataArray.doctor() methods) that would perform a battery of checks and return helpful diagnostics, e.g.,

Describe alternatives you've considered

None

Additional context

No response

max-sixty commented 2 years ago

Very much agree with the goal!

I wonder whether there's a broader approach with something like xr.describe — i.e. give lots of useful info about the metadata of the array, including any warnings. It's not that performance sensitive, so it would be fine to throw lots of things in there.

Either way, I'm a +1

rabernat commented 1 year ago

Just found this issue! I agree that this would be helpful. But isn't it fundamentally a Dask issue? Vanilla Xarray + Numpy has none of these problems because everything is in memory.

echarles commented 1 year ago

Vanilla Xarray + Numpy has none of these problems because everything is in memory.

This is my understanding of xarray. Or is there a way that a xarray variable points to a dask structure?

But isn't it fundamentally a Dask issue?

Dask has already some performance_report capabilities documented on https://docs.dask.org/en/stable/diagnostics-distributed.html#capture-diagnostics. Anything missing out there?

benbovy commented 1 year ago

The kind of data wrapped in an Xarray Dataset (e.g., a Numpy array, a Dask array or any other array #5648) is already something useful that xr.doctor or xr.describe may tell!

From my experience of introducing Xarray to new users, they often completely ignore what is under the hood until something or someone makes them aware, likely after they experience some weird behavior or performance issue that is hard to figure out by themselves. Xarray objects are flexible container wrappers connected to a wide range of other Python libraries, such that it is hard to give a short introduction that covers all the important aspects (lazy / non-lazy, chunked / non-chunked, etc.). For example, it may be possible that someone who has never heard of Dask nor Zarr follows an Xarray tutorial that starts by opening a chunked dataset from a zarr store. In this case the rich repr of the Xarray Dataset doesn't even help.

Rather than a performance report or a profiling tool, the proposal here (still very elusive) is to provide a helper function that returns some information and explanation in plain english (why not with some hyperlinks, pretty printing, etc.) that would help users making sense of an Xarray object and its wrapped data/metadata. Some kind of interactive documentation very specific to the actual Xarray object. Some kind of smart tool that would partially "replace" custom (though very basic) user support.