pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.06k stars 294 forks source link

Failed to save tropomi nc file with specific variables loaded #1493

Closed zxdawn closed 3 years ago

zxdawn commented 3 years ago

Describe the bug I got this error which is similar with #1031 when I tried to save the loaded scene of TROPOMI NO2 L2 data to NetCDF file.

Failed to load coordinates for 'DataID(name='latitude', modifiers=())'
Failed to load coordinates for 'DataID(name='longitude', modifiers=())'
Can't load ancillary dataset /PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_pressure
Can't load ancillary dataset /PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_pressure
[DEBUG: 2020-12-25 13:32:46 : satpy.writers] Reading ['/mnt/raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/etc/writers/cf.yaml']
[INFO: 2020-12-25 13:32:46 : satpy.writers.cf_writer] Saving datasets to NetCDF4/CF.
/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/writers/cf_writer.py:255: UserWarning: Coordinate "longitude" referenced by dataset surface_albedo does not exist, dropping reference.
  warnings.warn('Coordinate "{}" referenced by dataset {} does not exist, dropping reference.'.format(
Traceback (most recent call last):
  File "test_save.py", line 37, in <module>
    s5p.save_datasets(filename='./output/test.nc')
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/scene.py", line 1368, in save_datasets
    return writer.save_datasets(datasets, compute=compute, **save_kwargs)
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/writers/cf_writer.py", line 704, in save_datasets
    datas, start_times, end_times = self._collect_datasets(
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/writers/cf_writer.py", line 609, in _collect_datasets
    link_coords(datas)
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/satpy/writers/cf_writer.py", line 253, in link_coords
    dataset[coord] = datas[coord]
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/xarray/core/dataarray.py", line 647, in __setitem__
    self.coords[key] = value
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/xarray/core/coordinates.py", line 40, in __setitem__
    self.update({key: value})
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/xarray/core/coordinates.py", line 118, in update
    self._update_coords(coords, indexes)
  File "/yin_raid/xin/miniconda3/envs/s5p_2021/lib/python3.8/site-packages/xarray/core/coordinates.py", line 293, in _update_coords
    raise ValueError(
ValueError: cannot add coordinates with new dimensions to a DataArray

To Reproduce

import os
import glob
from satpy.scene import Scene
#from satpy.utils import debug_on; debug_on()

s5p_nc_dir = 's5p_data'
f_s5p_pattern = os.path.join(s5p_nc_dir, '*20190725T*')
f_s5p = glob.glob(f_s5p_pattern)

s5p = Scene(f_s5p, reader='tropomi_l2')
vnames = ['surface_albedo', 'latitude']
s5p.load(vnames)
s5p.save_datasets(filename='./output/test.nc')

This issue happens when some specific variables like nitrogendioxide_stratospheric_column, surface_albedo and surface_pressure etc. are included in the vnames with the longitude or latitude together. Note that this issue doesn't exist before #1357 is merged.

Environment Info:

djhoese commented 3 years ago

Just curious, why are you specifically loading for longitude and latitude?

Could you print out the latitude DataArray? I'm curious what the coordinates, attributes, and dimensions are before trying to save it.

zxdawn commented 3 years ago

@djhoese Because longitude and latitude are original variables saved in the TROPOMI L2 data, we need to save them for later usage.

Here's some information about s5p['latitude'] and s5p['surface_albedo']:

<xarray.DataArray 'latitude' (y: 389, x: 450)>
dask.array<where, shape=(389, 450), dtype=float32, chunksize=(389, 450), chunktype=numpy.ndarray>
Coordinates:
    time       datetime64[ns] 2019-07-25
    latitude   (y, x) float32 dask.array<chunksize=(389, 450), meta=np.ndarray>
    longitude  (y, x) float32 dask.array<chunksize=(389, 450), meta=np.ndarray>
Dimensions without coordinates: y, x
Attributes:
    start_time:           2019-07-25 04:27:43
    end_time:             2019-07-25 06:09:12
    long_name:            pixel center latitude
    units:                degrees_north
    standard_name:        latitude
    valid_min:            -90.0
    valid_max:            90.0
    bounds:               /PRODUCT/SUPPORT_DATA/GEOLOCATIONS/latitude_bounds
    name:                 latitude
    file_type:            tropomi_l2
    file_key:             PRODUCT/latitude
    coordinates:          ('longitude', 'latitude')
    modifiers:            ()
    platform_shortname:   S5P
    sensor:               tropomi
    _satpy_id:            DataID(name='latitude', modifiers=())
    ancillary_variables:  []

<xarray.DataArray 'surface_albedo' (y: 389, x: 450)>
dask.array<where, shape=(389, 450), dtype=float32, chunksize=(389, 450), chunktype=numpy.ndarray>
Coordinates:
    crs      object +proj=latlong +datum=WGS84 +ellps=WGS84 +type=crs
Dimensions without coordinates: y, x
Attributes:
    start_time:            2019-07-25 04:27:43
    end_time:              2019-07-25 06:09:12
    units:                 1
    standard_name:         surface_albedo
    long_name:             Surface albedo in the cloud product
    coordinates:           ('longitude', 'latitude')
    radiation_wavelength:  758.0
    name:                  surface_albedo
    file_key:              PRODUCT/SUPPORT_DATA/INPUT_DATA/surface_albedo
    file_type:             tropomi_l2
    modifiers:             ()
    platform_shortname:    S5P
    sensor:                tropomi
    area:                  Shape: (389, 450)\nLons: <xarray.DataArray 'longit...
    _satpy_id:             DataID(name='surface_albedo', modifiers=())
    ancillary_variables:   []
djhoese commented 3 years ago

If you remove the coordinates from the .attrs of the lon/lats before saving, does it give you a different error?

del scn['longitude'].attrs['coordinates']
del scn['latitude'].attrs['coordinates']
zxdawn commented 3 years ago

Same error if I add del s5p['latitude'].attrs['coordinates'].

djhoese commented 3 years ago

What if you remove the coordinates attribute from all loaded DataArrays? There may be a chance that you have to remove it from the SwathDefinition, but lets ignore that for now.

My guess, after looking at the code, is that the CF writer is looking at .attrs['coordinates'] to determine what coordinates to associate with each DataArray. This is causing 'longitude' and 'latitude' to be assigned as coordinates when the dimension names are actually 'y' and 'x' and that makes xarray unhappy. Since the error only happens when you explicitly ask for longitude and latitude to be loaded there must be some check "does this coordinate exist in the variables I know about" and it ignores the coordinate if it doesn't exist.

That said, you don't need to explicitly add the longitude and latitude variables to what you load because the CF writer will add them automatically from the SwathDefinition (I think).

zxdawn commented 3 years ago

Thanks for the explanation David.

This method works well: skipping the manual saving of lon/lat and getting the lon and lat by lon, lat = s5p['surface_albedo'].attrs['area'].get_lonlats().

It looks reasonable to me. Shall we close this issue?

djhoese commented 3 years ago

I'd still like to know why the longitude and latitude datasets don't work. It seems like a bug in the CF writer.

mraspaud commented 3 years ago

@ninahakansson @sfinkens might have ideas

sfinkens commented 3 years ago

I guess this has something to do with the additional time dimension of the lat/lon coordinates. For example (I'm trying to reproduce the behaviour of the link_coords method here) this works well:

In [20]: a = xr.DataArray([[1,2], [3,4]], dims=('y', 'x'), coords={'x': [1,2], 'y': [3,4]})
In [21]: lon = a.copy()
In [22]: a['lon'] = lon
In [23]: a
Out[23]: 
<xarray.DataArray (y: 2, x: 2)>
array([[1, 2],
       [3, 4]])
Coordinates:
  * x        (x) int64 1 2
  * y        (y) int64 3 4
    lon      (y, x) int64 1 2 3 4

But with an additional time dimension it doesn't:

In [24]: lon = lon.expand_dims('time')
In [25]: a['lon'] = lon
...
ValueError: cannot add coordinates with new dimensions to a DataArray
djhoese commented 3 years ago

But why are the lon/lat coordinates getting a time dimension added to them? Or is that just a side effect of how xarray makes coordinates apply to all variables?

sfinkens commented 3 years ago

Oh, sorry. I thought the lats returned by the reader had an extra time dimension. But I misread the snippet above. It's just a scalar time coordinate, no extra dimension.

sfinkens commented 3 years ago

Aha, the CF writer is adding a time dimension if the dataset has a time coordinate: https://github.com/pytroll/satpy/blob/master/satpy/writers/cf_writer.py#L539 . I don't know why. But I think dropping the time coordinate from the lats/lons

lons = lons.drop('time')

should work around the problem for now.

The link_coords method could be fixed by dropping any scalar coordinates [set(lon.coords.keys()).difference(lon.dims)] before using them.

djhoese commented 3 years ago

The link_coords method could be fixed by dropping any scalar coordinates [set(lon.coords.keys()).difference(lon.dims)] before using them.

Maybe not just scalar but any coordinates that aren't also dimensions? Right?

zxdawn commented 3 years ago

@djhoese Yes, any coordinates that aren't also dimensions could lead to the error.

BTW, I'm curious whether the time dimension is useful for MultiScene? If we drop the time dimension, will the MultiScene still work?

djhoese commented 3 years ago

The MultiScene does not depend on or use a time dimension at all. It only depends on the order of the Scenes you provide it. It has the possibility to create an xarray Dataset with a time dimension where it uses start_time if no time-based dimension is available (if I remember correctly).

zxdawn commented 3 years ago

If I delete the time coordinates, then the time_bnds isn't available.

However, as discussed in #1246, we don't need time_bnds at all.

Is it correct to create a PR to delete the time coordinates for TROPOMI data?

zxdawn commented 3 years ago

I had a quick test and found that we need to drop two things to make the save_datasets() work well:

1) the time coordinates of all DataArrays

2) coordinates attributes of longitude and latitude

If this is OK, I will create a PR.

djhoese commented 3 years ago

Is it possible to rename things so they work better? Otherwise update CF writer to handle these? I'd prefer not to remove information from the reader unless it's redundant. Someone could be using it.

zxdawn commented 3 years ago

Here're some testing steps:

Try to fix by editing yaml and reader

Step1: Drop time coordinates

I add these:

        ## delete the longitude/latitude Coordinates for longitude/latitude variables
        if ds_id['name'] in ['longitude', 'latitude']:
            data = data.drop_vars(['time'])

above https://github.com/pytroll/satpy/blob/f2489726e0b64fbd4fe17a2ad36eaa2f1453f167/satpy/readers/tropomi_l2.py#L240

And I got this error:

  File "test_save.py", line 77, in <module>
    s5p.save_datasets(filename='./output/test.nc')
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/satpy/scene.py", line 1065, in save_datasets
    return writer.save_datasets(dataarrays, compute=compute, **save_kwargs)
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/satpy/writers/cf_writer.py", line 723, in save_datasets
    res = dataset.to_netcdf(filename, engine=engine, group=group_name, mode='a', encoding=encoding,
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/xarray/core/dataset.py", line 1689, in to_netcdf
    return to_netcdf(
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/xarray/backends/api.py", line 1107, in to_netcdf
    dump_to_store(
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/xarray/backends/api.py", line 1142, in dump_to_store
    variables, attrs = conventions.encode_dataset_coordinates(dataset)
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/xarray/conventions.py", line 806, in encode_dataset_coordinates
    return _encode_coordinates(
  File "/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/xarray/conventions.py", line 765, in _encode_coordinates
    written_coords.update(attrs["coordinates"].split())
AttributeError: 'list' object has no attribute 'split'

Step2: Edit coordinates in yaml file

I comment out the coordinates in yaml:

datasets:
    latitude:
        name: 'latitude'
        file_type: tropomi_l2
        file_key: 'PRODUCT/latitude'
        #coordinates: [longitude, latitude]
        standard_name: latitude
    longitude:
        name: 'longitude'
        file_type: tropomi_l2
        file_key: 'PRODUCT/longitude'
        #coordinates: [longitude, latitude]
        standard_name: longitude

Then, save_datasets() works well and the NetCDF file looks like this:

netcdf test {
dimensions:
    y = 389 ;
    x = 450 ;
variables:
    float latitude(y, x) ;
.........
    float longitude(y, x) ;
................
    float cloud_fraction_crb_nitrogendioxide_window(y, x) ;
.............
        cloud_fraction_crb_nitrogendioxide_window:coordinates = "longitude latitude" ;
.................

Try to Fix by link_coords

If I don't change the yaml and reader, just only add one print line above https://github.com/pytroll/satpy/blob/f2489726e0b64fbd4fe17a2ad36eaa2f1453f167/satpy/writers/cf_writer.py#L254

print(dataset[coord])

save_datasets() works again with some warnings:

/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/satpy/writers/cf_writer.py:260: UserWarning: Coordinate "longitude" referenced by dataset cloud_fraction_crb_nitrogendioxide_window does not exist, dropping reference.
  warnings.warn('Coordinate "{}" referenced by dataset {} does not exist, dropping reference.'.format(
/public/home/zhangxin/new/miniconda3/envs/s5p_tobac/lib/python3.8/site-packages/satpy/writers/cf_writer.py:260: UserWarning: Coordinate "latitude" referenced by dataset cloud_fraction_crb_nitrogendioxide_window does not exist, dropping reference.
  warnings.warn('Coordinate "{}" referenced by dataset {} does not exist, dropping reference.'.format(

The NetCDF file looks like this:

netcdf test {
dimensions:
    time = 1 ;
    y = 389 ;
    x = 450 ;
    bnds_1d = 2 ;
variables:
    int64 time(time) ;
        time:bounds = "time_bnds" ;
        time:standard_name = "time" ;
        time:units = "days since 2019-07-25" ;
        time:calendar = "proleptic_gregorian" ;
    float cloud_fraction_crb_nitrogendioxide_window(y, x) ;
.............
    float longitude(y, x, time) ;
............
    float latitude(time, y, x) ;
...........
    double time_bnds(time, bnds_1d) ;

Thoughts

I suppose the time dimension should be dropped, but don't know the correct way to add it to CF writer ...

djhoese commented 3 years ago

In that last test with the print, are you saying it suddenly works after printing the coordinates?

The summary of my opinion on how to fix this is: If there is something semi-standard in the tropomi data (like a time dimension) that fails when using the CF writer, then the CF writer should be changed to handle it.