NEXRAD issue - IndexError index out of bounds

rabernat commented 1 month ago

xradar version: 0.6.4
xarray version: 2024.7.0
Python version: 3.12.4
Operating System: Linux

Description

I have found a puzzling bug that only comes up in certain situations with Dask

What I Did

import xradar
import xarray as xr
import pooch

# download and open a NEXRAD2 file from S3
url = "https://noaa-nexrad-level2.s3.amazonaws.com/2024/09/01/FOP1/FOP120240901_000347_V06"
local_file = pooch.retrieve(url, known_hash=None)
ds = xr.open_dataset(local_file, group="sweep_0", engine="nexradlevel2")

# create a chunked version 
dsc = ds.chunk()
# load one variable - IMPORTANT - skipping this step makes the next line work
dsc.DBZH.load()
# load the entire dataset
dsc.load()
# - > IndexError: index 140 is out of bounds for axis 0 with size 38
# try all the variables
for v in dsc:
    print(v)
    try:
        dsc[v].load()  # also fails with dsc!
        print("ok")
    except Exception as e:
        print(e)
# DBZH
# ok
# ZDR
# index 212 is out of bounds for axis 0 with size 212
# PHIDP
# index 140 is out of bounds for axis 0 with size 130
# RHOHV
# index 140 is out of bounds for axis 0 with size 22
# CCORH
# index 140 is out of bounds for axis 0 with size 73
# sweep_mode
# ok
# sweep_number
# ok
# prt_mode
# ok
# follow_mode
# ok
# sweep_fixed_angle
# ok

Possibly related to #180.

Experience tells me this has something to do with Dask task tokenization.

syedhamidali commented 1 month ago

@rabernat Thanks for sharing this issue.

The IndexError seems related to variables with inconsistent dimensions. Some variables (e.g., sweep_mode, sweep_number) are scalars, while others (e.g., DBZH, ZDR) are multi-dimensional, which could be causing the issue with Dask chunking.

To focus on the multi-dimensional variables, you can try:

import xradar
import xarray as xr
import pooch

# download and open a NEXRAD2 file from S3
url = "https://noaa-nexrad-level2.s3.amazonaws.com/2024/09/01/FOP1/FOP120240901_000347_V06"
local_file = pooch.retrieve(url, known_hash=None)
ds = xr.open_dataset(local_file, group="sweep_0", engine="nexradlevel2")

# create a chunked version 
dsc = ds.chunk()
for var in dsc.data_vars:
    if len(dsc[var].dims) > 1:
        print(var)
        display(dsc[var].load())

rabernat commented 1 month ago

@syedhamidali - I'm not sure I understand your reponse.

Loading this dataset works fine without Dask. When dask comes into the picture, we get an error. This seems like a bug in xradar. The workaround you proposed does not address the root cause.

kmuehlbauer commented 1 month ago

Thanks for the detailed report @rabernat. I've reopened #180 as it wasn't fully resolved.

A deeper look will take some time. We will definitely look into this after ERAD 2024 where the majority of the xradar devs are currently.

Side note: @rabernat You might be interested in the short course we gave last Sunday where we acknowledged the great work of pangeo and project pythia.

Thanks also to @syedhamidali for taking care here.

syedhamidali commented 1 month ago

@kmuehlbauer I wanted to mention that I ran the same code with other file types (Cfradial, Iris...), and they all experienced the same issue with Dask chunking.

openradar / xradar

NEXRAD issue - IndexError index out of bounds #207

Description

What I Did