spestana / goes-ortho

Functions for downloading GOES-R ABI imagery, orthorectifying with a DEM, creating timeseries for a single point from a stack of ABI images
https://spestana.github.io/goes-ortho/
GNU General Public License v3.0
13 stars 4 forks source link

make_abi_timeseries: "ValueError: Index has duplicate keys" on set_index step #8

Closed spestana closed 3 years ago

spestana commented 3 years ago

Attempting to run:

directory = '/storage/GOES/goes17/2021/3/'
product = 'RadC-*C14*'
data_vars = ['Rad']
lat =  37.813439 
lon = -119.48451 
elev = 2630 
outfilepath = 'g17_Mar2021_olmsted.csv'

df = goes_ortho.make_abi_timeseries(directory, product, data_vars, lon, lat, elev, outfilepath)

Getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-76210dcbec04> in <module>
      7 outfilepath = 'g17_Mar2021_olmsted.csv'
      8 
----> 9 df = goes_ortho.make_abi_timeseries(directory, product, data_vars, lon, lat, elev, outfilepath)

~/git/goes-ortho/goes_ortho.py in make_abi_timeseries(directory, product, data_vars, lon, lat, z, outfilepath)
    511 
    512     # set the dataframe intext to the timestamp column
--> 513     df.set_index('time', inplace = True, verify_integrity = True)
    514 
    515     # if an output filepath was provided, save the dataframe as a csv

~/opt/anaconda3/envs/goes-linux/lib/python3.6/site-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   4463         if verify_integrity and not index.is_unique:
   4464             duplicates = index[index.duplicated()].unique()
-> 4465             raise ValueError("Index has duplicate keys: {dup}".format(dup=duplicates))
   4466 
   4467         # use set to handle duplicate column names gracefully in case of drop

ValueError: Index has duplicate keys: DatetimeIndex(['2021-03-29 21:21:17.600000', '2021-03-29 21:26:17.600000',
               '2021-03-29 21:31:17.600000', '2021-03-29 21:36:17.600000',
               '2021-03-29 21:41:17.600000', '2021-03-29 21:46:17.600000',
               '2021-03-29 21:51:17.600000', '2021-03-29 21:56:17.600000',
               '2021-03-29 22:01:17.600000', '2021-03-29 22:06:17.600000',
               '2021-03-29 22:11:17.600000', '2021-03-29 22:16:17.600000',
               '2021-03-29 22:21:17.600000', '2021-03-29 22:26:17.600000'],
              dtype='datetime64[ns]', name='time', freq=None)
spestana commented 3 years ago

I looked at the GOES-17 ABI L1b-RadC NetCDF files I have for 3/29/2021 22:00 UTC and it looks like there are some pairs of files with the same start and end timestamp, but with different created timestamps (just looking at the filename itself)

''' OR_ABI-L1b-RadC-M6C14_G17_s20210882201176_e20210882203549_c20210882204004.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882201176_e20210882203549_c20210882204006.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882206176_e20210882208549_c20210882208592.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882206176_e20210882208549_c20210882209004.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882211176_e20210882213549_c20210882213594.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882211176_e20210882213549_c20210882213596.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882216176_e20210882218549_c20210882218593.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882216176_e20210882218549_c20210882219008.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882221176_e20210882223549_c20210882223597.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882221176_e20210882223549_c20210882224001.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882226176_e20210882228549_c20210882228597.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882226176_e20210882228549_c20210882229007.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882231176_e20210882233549_c20210882234013.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882236176_e20210882238549_c20210882239010.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882241176_e20210882243549_c20210882244005.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882246176_e20210882248549_c20210882248598.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882251176_e20210882253549_c20210882254000.nc OR_ABI-L1b-RadC-M6C14_G17_s20210882256176_e20210882258549_c20210882259005.nc '''

For example,

'gdalinfo NETCDF:"OR_ABI-L1b-RadC-M6C14_G17_s20210882211176e20210882213549c20210882213594.nc":Rad' shows:

''' NC_GLOBAL#date_created=2021-03-29T22:13:59.4Z ... NC_GLOBAL#time_coverage_end=2021-03-29T22:13:54.9Z NC_GLOBAL#time_coverage_start=2021-03-29T22:11:17.6Z '''

and then 'gdalinfo NETCDF:"OR_ABI-L1b-RadC-M6C14_G17_s20210882211176e20210882213549c20210882213596.nc":Rad' shows:

''' NC_GLOBAL#date_created=2021-03-29T22:13:59.6Z ... NC_GLOBAL#time_coverage_end=2021-03-29T22:13:54.9Z NC_GLOBAL#time_coverage_start=2021-03-29T22:11:17.6Z '''

So these have the same start and end dates, but different create dates. Why are there duplicates being created?

spestana commented 3 years ago

These might be related:

spestana commented 3 years ago

To handle this situation (which I hope doesn't happen often going forward), I've just added a df.drop_duplicates() statement using option keep='first' (somewhat arbitrarily, as the duplicate files I'm seeing from NOAA AWS seem identical so this shouldn't make a difference in this case).