pmlmodelling / nctoolkit

A Python package for netCDF analysis and post-processing
https://nctoolkit.readthedocs.io/en/latest/
GNU General Public License v3.0
79 stars 11 forks source link

[JOSS] Object representation does not reflect lazy operations #77

Closed malmans2 closed 1 year ago

malmans2 commented 1 year ago

Describe the bug Looks like I'm not able to select a single year from a NCEP dataset (or at least the representation of the DataSet object does not show the subsetting).

To Reproduce

import nctoolkit as nc
nc_ds = nc.open_url("https://github.com/pydata/xarray-data/raw/master/air_temperature.nc")
print(nc_ds)
nc_ds.subset(year = 2013)
print(nc_ds)
nctoolkit is using Climate Data Operators version 2.2.0
Downloading https://github.com/pydata/xarray-data/raw/master/air_temperature.nc

The variable air has integer data type. Consider setting data type to float 'F64' or 'F32' using set_precision.

<nctoolkit.DataSet>:
Number of files: 1
File contents:
  variable  ntimes  npoints  nlevels                                   long_name  unit data_type
0      air    2920     1325        1  4xDaily Air temperature at sigma level 995  degK       I16

<nctoolkit.DataSet>:
Number of files: 1
File contents:
  variable  ntimes  npoints  nlevels                                   long_name  unit data_type
0      air    2920     1325        1  4xDaily Air temperature at sigma level 995  degK       I16

Expected behavior ntimes should change from 2920 to 1460

import xarray as xr
xr_ds = xr.tutorial.open_dataset("air_temperature").chunk()
print(xr_ds.dims)
xr_ds = xr_ds.sel(time="2013")
print(xr_ds.dims)
Frozen({'lat': 25, 'time': 2920, 'lon': 53})
Frozen({'lat': 25, 'time': 1460, 'lon': 53})

Desktop (please complete the following information):

https://github.com/openjournals/joss-reviews/issues/5494

malmans2 commented 1 year ago

I see, I need to run nc_ds.run() to actually see the changes. I find this quite confusing, especially because dataset objects are modified in place and therefore it's very hard to keep track of the modifications that will be applied.

I think that the representation of the object should show the modified coordinates/dimensions/sizes (same as xarray+dask, which is also lazy), or it should show at least all the operations that will be applied when nc_ds.run() is called.

robertjwilson commented 1 year ago

Yeah, there is some ambiguity here. I think the solution is to automatically run ds.run() when you access attributes etc. This behaviour would be more what a new user would expect. And it's not going to have any computational impacts, as you'll only really be accessing attributes interactively, not when scripting.

In theory, changes could be tracked without running commands, but that would just become very awkward book-keeping.

malmans2 commented 1 year ago

OK, I think it's clear now.

Very minor thing I've noticed. When I run

nc_ds = nc.open_url("https://github.com/pydata/xarray-data/raw/master/air_temperature.nc")

there's a weird string printed while downloading. (The string is the second half of the url).

robertjwilson commented 1 year ago

That's strange. This prints OK for me on Linux.

What Python version/OS are you using?

The code is just print(f"Downloading {x}"), where x is the string of the url. So It's hard to see what could cause this

malmans2 commented 1 year ago

I'm using macOS. Here is the env: nctoolkit_env.txt

I tried both python and ipython, same issue.

robertjwilson commented 1 year ago

OK. This seems to be a shell issue. I also have this, which I remember was just to improve the printing.

print("\033[A \033[A")

This must behave differently on macs.

Printing the url you are downloading is overkill. So I've just removed that from the function in the dev version.