pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.06k stars 292 forks source link

extract scene metadata from satellite image #1675

Open willows-1 opened 3 years ago

willows-1 commented 3 years ago

Hi, I would like to know how I can extract scene metadata (such as start and end time, longitude, latitude, etc) from satellite image? Appreciate all the help I can get!

pnuu commented 3 years ago

Hi,

This is how to get the metadata you mentioned. Depending on the data you have, modify the channel name(s) to load. Inspect the dictionaries scn.attrs and scn["some_channel"].attrs to see what metadata are available.

from satpy import Scene
import glob

fnames = glob.glob("/path/to/data/for/one/time/slot/*")
scn = Scene(filenames=fnames)
scn.load(["some_channel"])
# Get coordinates for the data
lons, lats = scn["some_channel"].attrs['area'].get_lonlats()
# Get the scene start/end times
start_time = scn.attrs["start_time"]
end_time = scn.attrs["end_time"]
willows-1 commented 3 years ago

Hi pnuu, thank you very much for your reply! If I want to crop out the image and then find the longitude and latitude, is it possible? I did a short code on cropping of the image. Below is the entire code, which also includes the one you suggested:

%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from glob import glob

from satpy.scene import Scene
from satpy import find_files_and_readers
from datetime import datetime

from glob import glob
from rasterio.plot import plotting_extent

import cartopy
import cartopy.crs as ccrs
from cartopy._crs import (CRS, Geodetic, Globe, PROJ4_VERSION,
                          WGS84_SEMIMAJOR_AXIS, WGS84_SEMIMINOR_AXIS)
from satpy.writers import get_enhanced_image
from satpy import Scene

filenames = glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/16_bands/*20210417_0200*.DAT')

len(filenames)

scn = Scene(reader='ahi_hsd', filenames=filenames)
my_channel = 'B01'

cropped_scn = scn.crop(ll_bbox=(103., 1.,105., 3.))

# resample from satellite native geolocation 
new_scn = cropped_scn.resample(resampler='native')
new_scn.load([my_channel])

# Get coordinates for the data
lons, lats = new_scn[my_channel].attrs['area'].get_lonlats()
# Get the scene start/end times
start_time = scn.attrs['start_time']
end_time = scn.attrs['end_time']
willows-1 commented 3 years ago

and is it possible to save these data as xarray or netcdf?

djhoese commented 3 years ago

@willows-1 It looks like you have already figured out how to get the lon/lat arrays (with your lons, lats = new_scn[my_channel].attrs['area'].get_lonlats() line). There are other methods on that AreaDefinition object for get other similar information like row/col indexes, convert x/y coordinates to lon/lat coordinates, etc.

What do you mean by saving "as xarray"? For netcdf, you can see the 'cf' writer in Satpy. Please see the Satpy documentation for more information.

willows-1 commented 3 years ago

Hi djhoese, thanks for your reply! So just to confirm, basically the metadata of satellite image is the screenshot as shown below?

WhatsApp Image 2021-05-17 at 16 37 22

willows-1 commented 3 years ago

and basically the metadata of longitude and latitude coordinates are usually in array form?

willows-1 commented 3 years ago

as for the saving the metadata as netcdf, I can follow this code below?

from satpy import Scene
import glob
filenames = glob.glob('data/H*201903011200*')
scn = Scene(filenames=filenames, reader='seviri_l1b_hrit')
scn.load(['VIS006', 'IR_108'])
scn.save_datasets(writer='cf', datasets=['VIS006', 'IR_108'], filename='seviri_test.nc',
                      exclude_attrs=['raw_metadata'])
djhoese commented 3 years ago

Regarding metadata, yes although for start/end time I would suggestion doing new_scn[my_channel].attrs['start_time']. The .attrs dictionary contains all of your metadata for a particular dataset. This should be described in the Satpy documentation.

Regarding lon/lat arrays, yes they are almost always arrays as Satpy will typically be dealing with 2D image arrays and the lon/lat arrays provide the coordinates for each pixel. If you want the coordinates for only a specific pixel then I suggest looking at the other methods of the AreaDefinition object (.attrs['area'])). I'm sorry, I don't remember the names off the top of my head right now and it depends what you want to do.

Regarding the CF writer, yes that looks correct.

willows-1 commented 3 years ago

thank you very much for the feedback. So the code will be as below?


# Get coordinates for the data
lons, lats = new_scn[my_channel].attrs['area'].get_lonlats()
# Get the scene start/end times
start_time = new_scn[my_channel].attrs['start_time']
end_time = new_scn[my_channel].attrs['end_time']```
djhoese commented 3 years ago

Yes.

willows-1 commented 3 years ago

thank you very much for your guidance!

willows-1 commented 3 years ago

as for the saving the metadata as netcdf, I can follow this code below?

from satpy import Scene
import glob
filenames = glob.glob('data/H*201903011200*')
scn = Scene(filenames=filenames, reader='seviri_l1b_hrit')
scn.load(['VIS006', 'IR_108'])
scn.save_datasets(writer='cf', datasets=['VIS006', 'IR_108'], filename='seviri_test.nc',
                      exclude_attrs=['raw_metadata'])

Hi djhoese, just a quick follow up, I tried to save the meta data using the code above but I am getting this message: "No time dimension in datasets, skipping time bounds creation". How can I edit the code to include start and end time? As in how to include time dimension in datasets when saving scene metadata?

djhoese commented 3 years ago

That message is a warning, not necessarily an error. Have you checked the actual output? It should have start and end time information in it.

willows-1 commented 3 years ago

nope, there is no start and end time information in it. Below is the screenshot of the output when I opened the .nc file:

image

djhoese commented 3 years ago

What about in the B01 variable? It should have its own attributes and I would assume start_time and end_time would be there.

willows-1 commented 3 years ago

Is this how I call the 'B01' variable? I have attached the full code below:

image

image

djhoese commented 3 years ago

Yes. And just like in Satpy you can do file['B01'].attrs['start_time'] to get that start time information.

willows-1 commented 3 years ago

Understood, but is possible to achieve the output like the screenshot below? The picture below is the output generated when I opened another .nc file. But for this one, I didnt have to call out a variable. I just had to run the code and it displayed longitude, latitude as well as the start and end time:

image

djhoese commented 3 years ago

What are you actually trying to accomplish? Do you need to recreate this exact same NetCDF structure? I don't see a start time and end time in that output? I see "utc_hrs" which is similar but not the same. The NetCDF file you are writing with Satpy from a single Scene only has a single time step so it doesn't make much sense to have a full time variable (coordinate) as it would only have a single value in it.

willows-1 commented 3 years ago

Hi djhoese, my main aim is to obtain the longitude, latitude, the start_time and end_time for each band from B01-B16. So is it better to save the metadata of each band seperately in each file, like what I did previously by using the code below? And then open the .nc data and call the variable (eg. 'B01')?

from satpy import Scene
import glob
filenames = glob.glob('data/H*201903011200*')
scn = Scene(filenames=filenames, reader='seviri_l1b_hrit')
scn.load(['VIS006', 'IR_108'])
scn.save_datasets(writer='cf', datasets=['VIS006', 'IR_108'], filename='seviri_test.nc',
                      exclude_attrs=['raw_metadata'])
djhoese commented 3 years ago

Yes, I think so. You don't have to necessarily save each band separately, but you do need to access it individually. If there is a specific way you expect or want the NetCDF file to look like then we would accept a pull request to make the necessary changes (and preserve backwards compatibility). Right now I'm not sure I fully understand what is missing from the currently produced file that isn't what you want (except for the time dimension which as I said doesn't really make sense in this case).

willows-1 commented 3 years ago

hi djhoese, sure! There was also another way that I found and that was using xr.DataArray. Below is the list of code. However, this time, the longitude and latitude data is missing. The start_time and end_time is also not present. Is there a way to modify the below code to include the longitude and latitude data and the time?

%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from glob import glob

from satpy.scene import Scene
from satpy import find_files_and_readers
from datetime import datetime

import xarray as xr
import numpy as np
import scipy as sp
import numpy as np

filenames = glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/16_bands/*20210417_0200*.DAT')

scn = Scene(reader='ahi_hsd', filenames=filenames)

# load datasets from input files
scn.load(['B01'])

#crop image
cropped_scn = scn.crop(ll_bbox=(103., 1.,105., 3.))

# resample from satellite native geolocation
new_scn = cropped_scn.resample(resampler='native')

new_scn.load(['B01'])

new_scn['B01'] = xr.DataArray(new_scn['B01'],
                             dims = ['x', 'y'],
                            coords = {'time': np.datetime64('2021-04 17T00:02:00')},
                            attrs = dict(start_time = new_scn.start_time, 
                                         end_time = new_scn.end_time))

new_scn['B01']

new_scn.save_datasets(writer='cf', datasets= ['B01'], filename='B01-0200.nc',
                     exclude_attrs=['raw_metadata'])

#opening .nc file
import xarray as xr
import numpy as np
import scipy as sp

file = xr.open_dataset('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/B01-0200.nc', engine = 'netcdf4')

And this was the output generated when I opened the file:

image

So in this case the longitude and latitude data is missing

willows-1 commented 3 years ago

but I think using xr.DataArray is more challenging?

djhoese commented 3 years ago

Try replacing this:

new_scn['B01'] = xr.DataArray(new_scn['B01'],
                             dims = ['x', 'y'],
                            coords = {'time': np.datetime64('2021-04 17T00:02:00')},
                            attrs = dict(start_time = new_scn.start_time, 
                                         end_time = new_scn.end_time))

with:

new_scn['B01'].coords['time'] = np.datetime64('2021-04 17T00:02:00')
djhoese commented 3 years ago

If you need both start and end time (I assume B01 only has start_time) then you can do:

new_scn['B01'].attrs['end_time'] = new_scn['B01'].attrs['start_time']

Or whatever you would like to do. I also realize that your time coordinate is hardcoded. To use start_time I think you could do:

new_scn['B01'].coords['time'] = np.datetime64(new_scn['B01'].attrs['start_time'])
willows-1 commented 3 years ago

Hmm actually B01 has both start and end time. Also, since now xr. DataArray is removed, how should I then proceed with the saving of the meta data?

djhoese commented 3 years ago

The same as you have it. This code you had:

new_scn['B01'] = xr.DataArray(new_scn['B01'],
                             dims = ['x', 'y'],
                            coords = {'time': np.datetime64('2021-04 17T00:02:00')},
                            attrs = dict(start_time = new_scn.start_time, 
                                         end_time = new_scn.end_time))

new_scn['B01']

The first chunk of code is overwriting B01 in the new_scn. The second chunk isn't doing anything (outside of printing the DataArray in a notebook). The code I told you above is adding metadata to the DataArray. new_scn['B01'] is still a DataArray and should still work in the same way with save_datasets.

willows-1 commented 3 years ago

Alright sure. So if I want to use end_time, the code will be the one below? But then it will be the same code as the start time like the one you suggested

new_scn['B01'].coords['time'] = np.datetime64(new_scn['B01'].attrs['end_time'])

willows-1 commented 3 years ago

And what about the longitude and latitude coordinates? How can I write the code?

djhoese commented 3 years ago

So if I want to use end_time, the code will be the one below? But then it will be the same code as the start time like the one you suggested

As I said, this was only needed if the end_time didn't exist. You said it already exists for the B01 variable in the file you were originally producing so you don't need it.

Your original file also had the longitude and latitude arrays so you don't need to do anything extra. They are already there. By doing the .coords['time'] = stuff we are adding to the information, not taking anything away. Everything that was there should still be there as it was in your original file.

willows-1 commented 3 years ago

alright, thank you very much!

willows-1 commented 3 years ago

Sorry djhoese, I have one more question. If I want to combine all bands into one data, how to adjust the code for "new_scn['B01'].coords['time'] = np.datetime64(new_scn['B01'].attrs['end_time'])" in order to add all the time data for all data? I am getting the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-503d65a00bb9> in <module>
----> 1 new_scn.scn.all_dataset_names().coords['time'] = np.datetime64(new_scn.scn.all_dataset_names().attrs['end_time'])
      2 
      3 new_scn.scn.all_dataset_names().coords['time'] = np.datetime64(new_scn.scn.all_dataset_names().attrs['start_time'])

AttributeError: 'Scene' object has no attribute 'scn'

Below is the updated code that I have done so far to include all bands at once:

from satpy import Scene, MultiScene
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyresample import geometry

%matplotlib notebook
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from glob import glob

from satpy.scene import Scene
from satpy import find_files_and_readers
from datetime import datetime

import xarray as xr
import numpy as np
import scipy as sp
import numpy as np

filenames = glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/16_bands/*20210417_0200*.DAT')

scn = Scene(reader='ahi_hsd', filenames=filenames)
scn.load(scn.all_dataset_names())

#crop image
cropped_scn = scn.crop(ll_bbox=(103., 1.,105., 3.))

# resample from satellite native geolocation
new_scn = cropped_scn.resample(resampler='native')

new_scn.load(scn.all_dataset_names())

new_scn.scn.all_dataset_names().coords['time'] = np.datetime64(new_scn.scn.all_dataset_names()..attrs['end_time'])

new_scn.scn.all_dataset_names().coords['time'] = np.datetime64(new_scn.scn.all_dataset_names().attrs['start_time'])

new_scn.save_datasets(writer='cf', datasets= scn.all_dataset_names(), filename='All band data at 0200 version2.nc',
                     exclude_attrs=['raw_metadata'], base_dir = "C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm")
djhoese commented 3 years ago

I think this is a small misunderstanding about how dictionaries in Python work. First, you can't do new_scn[multiple_names] and update all the names at once. You have to do one at a time. Also, you have mistake in your code where you do new_scn.scn.all_dataset_names().coords['time'] when what you meant was new_scn[scn.all_dataset_names()].coords['time'], but as I said this won't work. The traceback/error was trying to tell you this.

I personally would suggest being explicit about what bands you want.

all_names = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B10', 'B11', 'B12', 'B13', 'B14', 'B15', 'B16']
scn.load(all_names)
new_scn = cropped_scn.resample(resampler='native')
for ds_name in all_names:
    new_scn[ds_name].coords['time'] = np.datetime64(new_scn[ds_name].attrs['start_time'])

Other gotchas:

  1. You have new_scn.load(scn.all_dataset_names()) in your code. You can't load from a resampled scene.
  2. You are assigning to .coords['time'] twice. Once for end_time and once for start_time. The way the code is written and being used, you can only have one value for the time coordinate so assigning twice, first with end_time then start_time, is overwriting the end_time assignment.
willows-1 commented 3 years ago

alright, understood now, so how will I proceed with saving of the metadata? Is it the code below?

new_scn.save_datasets(writer='cf', datasets= all_names, filename='All band data at 0200 version2.nc', exclude_attrs=['raw_metadata'], base_dir = "C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm")

djhoese commented 3 years ago

Trying running it and you let me know if you get an error.

willows-1 commented 3 years ago

yeap I got an error saying:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-e1edaf71397e> in <module>
----> 1 new_scn.save_datasets(writer='cf', datasets= all_names, filename='All band data at 0200 version2.nc', exclude_attrs=['raw_metadata'], base_dir = "C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm")

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\satpy\scene.py in save_datasets(self, writer, filename, datasets, compute, **kwargs)
   1045                                           filename=filename,
   1046                                           **kwargs)
-> 1047         return writer.save_datasets(dataarrays, compute=compute, **save_kwargs)
   1048 
   1049     @staticmethod

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\satpy\writers\cf_writer.py in save_datasets(self, datasets, filename, groups, header_attrs, engine, epoch, flatten_attrs, exclude_attrs, include_lonlats, pretty, compression, include_orig_name, numeric_name_prefix, **to_netcdf_kwargs)
    761             if 'time' in dataset:
    762                 dataset['time_bnds'] = make_time_bounds(start_times,
--> 763                                                         end_times)
    764                 dataset['time'].attrs['bounds'] = "time_bnds"
    765                 dataset['time'].attrs['standard_name'] = "time"

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\dataset.py in __setitem__(self, key, value)
   1522 
   1523         else:
-> 1524             self.update({key: value})
   1525 
   1526     def __delitem__(self, key: Hashable) -> None:

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\dataset.py in update(self, other)
   4036         Dataset.assign
   4037         """
-> 4038         merge_result = dataset_update_method(self, other)
   4039         return self._replace(inplace=True, **merge_result._asdict())
   4040 

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\merge.py in dataset_update_method(dataset, other)
    976         priority_arg=1,
    977         indexes=indexes,
--> 978         combine_attrs="override",
    979     )

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
    619     coerced = coerce_pandas_values(objects)
    620     aligned = deep_align(
--> 621         coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    622     )
    623     collected = collect_variables_and_indexes(aligned)

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    428         indexes=indexes,
    429         exclude=exclude,
--> 430         fill_value=fill_value,
    431     )
    432 

C:\ProgramData\anaconda3\envs\satpy\lib\site-packages\xarray\core\alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    334                     "aligned because they have different dimension size(s) %r "
    335                     "than the size of the aligned dimension labels: %r"
--> 336                     % (dim, unlabeled_sizes, labeled_size)
    337                 )
    338 

ValueError: arguments without labels along dimension 'time' cannot be aligned because they have different dimension size(s) {1} than the size of the aligned dimension labels: 16
djhoese commented 3 years ago

@sfinkens Any ideas what's going wrong here?

willows-1 commented 3 years ago

I tried to execute a different method by saving the metadata of each band seperately in each file, then I used the following code to combine them under one file. Is this the correct thing to do? Below is the code:

import netCDF4
import numpy as np
import xarray
from glob import glob

ds = xarray.merge([xarray.open_dataset(f) for f in glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/17-4-21-0200-Band 1-16/*.nc')])

ds.to_netcdf('Combined band data at 0200.nc')

Then I opened the combined data file and the output looked like this:

image

willows-1 commented 3 years ago

So basically I combined metadata of Band 1-16 at 0200 time under file and did the same for other times

willows-1 commented 3 years ago

is it okay to do this?

djhoese commented 3 years ago

is it okay to do this?

"ok" is relative. If you get the output you want then it is perfectly fine. I expected your original "all-in-one" code to work, but you got an error. I've asked above to see if @sfinkens has any ideas as he was the last person I know to look at some of this time dimension stuff. However, he is also probably not going to be online much today so we'll have to wait until he has some free time.

willows-1 commented 3 years ago

sure no worries

willows-1 commented 3 years ago

I tried to execute a different method by saving the metadata of each band seperately in each file, then I used the following code to combine them under one file. Is this the correct thing to do? Below is the code:

import netCDF4
import numpy as np
import xarray
from glob import glob

ds = xarray.merge([xarray.open_dataset(f) for f in glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/17-4-21-0200-Band 1-16/*.nc')])

ds.to_netcdf('Combined band data at 0200.nc')

Then I opened the combined data file and the output looked like this:

image

so for the alternative code I mentioned above, each of the band file was basically concatenated. But will the data be affected if I use this method? As in will I be able to retain the data when I concatenate?

djhoese commented 3 years ago

so for the alternative code I mentioned above, each of the band file was basically concatenated. But will the data be affected if I use this method? As in will I be able to retain the data when I concatenate?

I'm not sure. I've never done this before. It looks OK though.

willows-1 commented 3 years ago

alright, no worries, thank you very much for assisting me so far! so I think now we will wait for @sfinkens

sfinkens commented 3 years ago

If I understand correctly, you want the netCDF variable to have dimensions (time, y, x), right? To achieve this you need to expand the dataset dimensions as follows:

for band in mybands:
    scn[band] = scn[band].expand_dims(time=[scn[band].attrs['start_time']])
scn.save_datasets(...)

Note that the time argument in expand_dims is a list. The CF writer should probably print a more informative error message if the dataset has a "free" time coordinate.

I also see that you try to save a timeseries to netCDF. You could do that using MultiScene by creating a timeseries and saving that to netcdf. However, the second setep currently fails due to a bug in the CF writer (https://github.com/pytroll/satpy/issues/1242; there is also a code example how it's supposed to work). I think @zxdawn wanted to look into this.

zxdawn commented 3 years ago

If I understand correctly, as suggested by @sfinkens, MultiScene is the best solution for time series data.

But, the time_bnds issue is a headache ... I will try to fix it soon.

@willows-1 Could you show your expected results here? Then, I can try to make a PR to meet it and #1242.

willows-1 commented 3 years ago

hi zxdawn and sfinkens, thanks for your reply! @zxdawn, you want my expected results using time_bnds?

willows-1 commented 3 years ago

If I understand correctly, you want the netCDF variable to have dimensions (time, y, x), right? To achieve this you need to expand the dataset dimensions as follows:

for band in mybands:
    scn[band] = scn[band].expand_dims(time=[scn[band].attrs['start_time']])
scn.save_datasets(...)

Note that the time argument in expand_dims is a list. The CF writer should probably print a more informative error message if the dataset has a "free" time coordinate.

I also see that you try to save a timeseries to netCDF. You could do that using MultiScene by creating a timeseries and saving that to netcdf. However, the second setep currently fails due to a bug in the CF writer (#1242; there is also a code example how it's supposed to work). I think @zxdawn wanted to look into this.

yes thats right because I want start_time and end_time

willows-1 commented 3 years ago

I tried to execute a different method by saving the metadata of each band seperately in each file, then I used the following code to combine them under one file. Is this the correct thing to do? Below is the code:

import netCDF4
import numpy as np
import xarray
from glob import glob

ds = xarray.merge([xarray.open_dataset(f) for f in glob('C:/Users/binis/OneDrive/Desktop/To_Binish/ftp_h8_hsd_2pm/17-4-21-0200-Band 1-16/*.nc')])

ds.to_netcdf('Combined band data at 0200.nc')

Then I opened the combined data file and the output looked like this: image

so for the alternative code I mentioned above, each of the band file was basically concatenated. But will the data be affected if I use this method? As in will I be able to retain the data when I concatenate?

@zxdawn @sfinkens what about using this method? Is it the correct method to combine the netcdf file of each band by concatenating? And will the data be affected if I use this method? As in will I be able to retain the data when I concatenate?