microsoft / PlanetaryComputer

Issues, discussions, and information about the Microsoft Planetary Computer
https://planetarycomputer.microsoft.com/
MIT License
176 stars 6 forks source link

Access to Sentinel 5P data - checksum and parallelised download #362

Open Lieselotte12 opened 1 month ago

Lieselotte12 commented 1 month ago

Hello everyone,

I'm using the Planetary Computer archive to get access to Sentinel-5P data for my master thesis. I want to clip the data to a specific bounding box, filter the data afterwards and calculate some values. The functions I wrote for it work fine (I tried it for several datasets), but I have a large period of time to cover and a lot of area of interests. Therefore, I implemented a for-loop to loop through each day of my time period. At one point, the script locks up at the opening of the next data (see function clip_dataset() below), so it doesn't print "Start" anymore (sometimes it can open the dataset from the day, sometimes it locks up). One possibility could be that it doesn't download all the data and can't do the other steps afterwards, so I want to check the checksum of the dataset. Is there a possibility to get it from the URL and how (e. g. with md5)? Is it also possibly to parallelise the download due to my huge amount of data I need for my thesis? Or doesn't the API allow it?

Thanks in advance!

def clip_dataset(item, indicator, min_lon_b, min_lat_b, max_lon_b, max_lat_b):
    try:
        with fsspec.open(item.assets[indicator].href, timeout=600) as f:  
            # Open the NetCDF dataset
            ds = xr.open_dataset(f, group="PRODUCT", engine="h5netcdf")
            print("Start")
            # Set the CRS
            ds_2 = ds.rio.write_crs("epsg:4326")

            print("beginning with masking for this item")
            # Create mask for background bounding box
            lat_back_mask = (ds_2['latitude'] >= min_lat_b) & (ds_2['latitude'] <= max_lat_b)
            lon_back_mask = (ds_2['longitude'] >= min_lon_b) & (ds_2['longitude'] <= max_lon_b)
            mask_back = lat_back_mask & lon_back_mask

            print("mask created")

            # Apply mask to filter the data for background bounding box
            background_ds = ds_2.where(mask_back, drop=True)
            print("masking finished for this item")

            return background_ds
    except Exception as e:
        print("Error in filter_dataset:", e)
        return None
TomAugspurger commented 1 month ago

We have seen some issues using h5py and fsspec (maybe in combination with h5netcdf). See https://github.com/h5py/h5py/issues/2019 and linked threads. That gets into the weeds, but the summary is that it's challenging to read NetCDF files over the network reliably.

Is it also possibly to parallelise the download

Yes, you can use something like concurrent.futures or dask or some other parallel programming library to do the data access in parallel if needed.