Progress bar on open_mfdataset

pydata / xarray

N-D labeled arrays and datasets in Python

https://xarray.dev

Apache License 2.0

3.57k stars 1.07k forks source link

Progress bar on open_mfdataset #9521

Open nordam opened 1 week ago

nordam commented 1 week ago

Is your feature request related to a problem?

I'm using xarray.open_mfdataset() to open tens of thousands of (fairly small) netCDF files, and it's taking quite some time. Being of an impatient nature, I would like to at least be assured that something is happening, so a progress bar would be nice. I found an example of using a progress bar from dask here: https://github.com/pydata/xarray/issues/4000#issuecomment-619003228

However, my attempt to adapt this solution doesn't show a progress bar. Any other options?

Here is the code I tried:

from dask.diagnostics import ProgressBar

with ProgressBar():
    d = xr.open_mfdataset('proc/*.nc')

Describe the solution you'd like

I'd like to see a nice and fairly minimal progress bar, for example telling me how many files have been dealt with so far.

Describe alternatives you've considered

Something based on tqdm would be nice, but could also be something else.

Additional context

No response

nordam commented 1 week ago

After discussion with a colleague, we ended up with this solution:

import xarray as xr
from dask.diagnostics import ProgressBar

with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1)) as d, ProgressBar():
    d.load()

This works in the strict sense that it displays a progress bar, but unfortunately it does nothing (no progress bar visible) for a couple of minutes (for the set of files I tested), and then the progress bar shows up and runs through in a few seconds. In other words, not very useful for an impatient soul like me.

I should add that I'm testing this in a jupyter notebook.

keewis commented 1 week ago

indeed, this does nothing if you don't pass parallel=True to open_mfdataset. What that does is parallelize the access to each file by creating one dask task per open_dataset on each file. Without it, open_dataset is called on each file in sequence without going through dask, so you don't get any feedback from dask.

The activity on the progress bar you get is the loading of each chunk into memory, which happens when you call d.load(), and so after the call to open_mfdataset.

What I think you should try is:

import xarray as xr
from dask.diagnostics import ProgressBar

with xr.open_mfdataset('proc/*.nc', chunks=dict(index=1), parallel=True) as d, ProgressBar():
    d.load()

(though you might not need the explicit chunks of index=1, this could also be chunks={})