pangeo-forge / gpcp-feedstock

A Pangeo Forge Feedstock for gpcp.
Apache License 2.0
3 stars 2 forks source link

Corrupted values in precip fields #5

Open rabernat opened 1 year ago

rabernat commented 1 year ago

There are evidently some corrupted values in the gpcp precip fields:

import xarray as xr

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
precip_ts = ds.precip.mean(("latitude", "longitude"))
precip_ts.plot()

image

It is unclear to me at this point whether these are in the original dataset (seems unlikely) or are instead an artifact that was generated in our pipeline.

rabernat commented 1 year ago

Here is the mask I am using to filter out the corrupted data

mask = (ds.precip >= 0) & (ds.precip < 1000)
cisaacstern commented 1 year ago

The corruption appears to occur independent of execution framework (i.e., it is not to_beam-specific). Using the current production dataset, we can zoom in on an area containing two corruptions:

import xarray as xr

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
precip_ts = ds.isel(dict(time=slice(2150, 2250))).precip.mean(("latitude", "longitude"))
precip_ts.plot()
Screen Shot 2022-07-14 at 9 58 33 AM

Then, in a notebook, we can clip the recipe's file_list to just this window using the same slice:

file_list = file_list[2150:2250]

Constructing a recipe from this clipped file_list and then running it in the notebook with recipe.to_function()() produces the same corruptions we see in the production data:

recipe.to_function()()
ds = xr.open_dataset(recipe.target_mapper, engine="zarr", chunks={})
precip_ts = ds.precip.mean(("latitude", "longitude"))
precip_ts.plot()
Screen Shot 2022-07-14 at 10 03 30 AM
cisaacstern commented 1 year ago

This actually does appear to originate in the source data itself. Zooming in, we can see that the leftmost corruption in the plots shown the previous comment occurs at index 2167:

import xarray as xr

store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
precip_ts = ds.isel(dict(time=slice(2166, 2168))).precip.mean(("latitude", "longitude"))
precip_ts.plot()
Screen Shot 2022-07-14 at 10 13 33 AM

We can then pull that specific source file out of the file_list

file_list[2167]
# --> 'https://www.ncei.noaa.gov/data/global-precipitation-climatology-project-gpcp-daily/access/2002/gpcp_v01r03_daily_d20020907_c20170530.nc'

Download it

$ wget https://www.ncei.noaa.gov/data/global-precipitation-climatology-project-gpcp-daily/access/2002/gpcp_v01r03_daily_d20020907_c20170530.nc

And open it to see that the mean precipitation for that file is indeed -28007.459

ds = xr.open_dataset("gpcp_v01r03_daily_d20020907_c20170530.nc")
ds.precip.mean(("latitude", "longitude"))
Screen Shot 2022-07-14 at 10 16 25 AM
rabernat commented 1 year ago

THIS IS WHY WE NEED PANGEO FORGE! 🚀

rabernat commented 1 year ago

From Howard Diamond at NCEI via email:

GPCP data at NCEI are mirrored from what we receive from NASA via https://www.nccs.nasa.gov/services/data-collections/atmosphere-based-products/gpcp; so, if there are any corrupted data, that would come to us from NASA; if NASA corrects that, then we would post whatever corrections they may make. We do no processing of the GPCP data.

christine-e-smit commented 1 year ago

Hello! I would like to track down where this data is coming from NASA, but I'm afraid that I don't recognize this dataset. How do I find out more about what this dataset is? I don't suppose you have a DOI for it? Or a product name and version?

rabernat commented 1 year ago

@christine-e-smit - thanks a lot for responding here! We acquired the data from NOAA NCEI and have faithfully reproduced the license information and provenance details that were available on the NCEI website:

https://github.com/pangeo-forge/gpcp-feedstock/blob/78ebfa39faafbb3eb352b3cdc0856d4b1a722b95/feedstock/meta.yaml#L12-L17

As you can see, the NCEI website (https://www.ncei.noaa.gov/products/global-precipitation-climatology-project) does not provide a DOI.

This information is what appears on the corresponding Pangeo Forge page: https://pangeo-forge.org/dashboard/feedstock/42

Screen Shot 2022-10-27 at 8 27 56 AM

Digging deeper in that page, under the "dataset" tab, I can see the following metadata embedded in the dataset

Conventions :
CF-1.6, ACDD 1.3
Metadata_Conventions :
CF-1.6, Unidata Dataset Discovery v1.0, NOAA CDR v1.0, GDS v2.0
acknowledgment :
This project was supported in part by a grant from the NOAA Climate Data Record (CDR) Program for satellites.
cdm_data_type :
Grid
cdr_program :
NOAA Climate Data Record Program for satellites, FY 2011.
cdr_variable :
precipitation
comment :
Processing computer: eagle2.umd.edu
contributor_name :
Robert Adler, Jian-Jian Wang
contributor_role :
principalInvestigator, processor and custodian
creator_email :
jjwang@umd.edu
creator_name :
Dr. Jian-Jian Wang
date_created :
2017-05-30T16:52:42Z
geospatial_lat_max :
90.0
geospatial_lat_min :
-90.0
geospatial_lat_resolution :
1 degree
geospatial_lat_units :
degrees_north
geospatial_lon_max :
360.0
geospatial_lon_min :
0.0
geospatial_lon_resolution :
1 degree
geospatial_lon_units :
degrees_east
history :
1) 2017-05-30T16:52:42Z, Dr. Jian-Jian Wang, U of Maryland, Created beta (B1) file
id :
199610/gpcp_v01r03_daily_d19961001_c20170530.nc
institution :
ACADEMIC > UMD/ESSIC > Earth System Science Interdisciplinary Center, University of Maryland
keywords :
EARTH SCIENCE > ATMOSPHERE > PRECIPITATION > PRECIPITATION AMOUNT
keywords_vocabulary :
NASA Global Change Master Directory (GCMD) Earth Science Keywords, Version 7.0
license :
No constraints on data access or use.
metadata_link :
gov.noaa.ncdc:XXXXX
naming_authority :
gov.noaa.ncdc
platform :
GOES (Geostationary Operational Environmental Satellite), GMS (Japan Geostationary Meteorological Satellite), METEOSAT, TIROS > Television Infrared Observation Satellite, DMSP (Defense Meteorological Satellite Program)
processing_level :
NASA Level 3
product_version :
v01r03
project :
GPCP > Global Precipitation Climatology Project
publisher_email :
jjwang@umd.edu
publisher_name :
NOAA National Centers for Environmental Information (NCEI)
publisher_url :
https://www.ncei.noaa.gov
references :
Huffman et al. 1997, http://dx.doi.org/10.1175/1520-0477(1997)078<0005:TGPCPG>2.0.CO;2; Adler et al. 2003, http://dx.doi.org/10.1175/1525-7541(2003)004<1147:TVGPCP>2.0.CO;2; Huffman et al. 2009, http://dx.doi.org/10.1029/2009GL040000; Adler et al. 2017, Global Precipitation Climatology Project (GPCP) Daily Analysis: Climate Algorithm Theoretical Basis Document (C-ATBD)
sensor :
Imager, TOVS > TIROS Operational Vertical Sounder, SSMI > Special Sensor Microwave/Imager
source :
/data1/GPCP_CDR/GPCP_Output/1DD//bin/199610/stfsg3.19961001.s
spatial_resolution :
1 degree
standard_name_vocabulary :
CF Standard Name Table (v41, 22 February 2017)
summary :
Global Precipitation Climatology Project (GPCP) Daily Version 1.3 gridded, merged satellite/gauge precipitation Climate data Record (CDR) from 1996 to present.
time_coverage_duration :
P1D
time_coverage_end :
1996-10-01T23:59:59Z
time_coverage_start :
1996-10-01T00:00:00Z
title :
Global Precipitation Climatatology Project (GPCP) Climate Data Record (CDR), Daily V1.3

If you want to play with the data yourself, you can do it with the following python code

import xarray as xr
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
rabernat commented 1 year ago

What is confusing to me here is that NCEI says "GPCP data at NCEI are mirrored from what we receive from NASA". However, nowhere in the NCEI metadata or website is there any mention of this provenance. The NCEI website suggests that the data come from University of Maryland.

christine-e-smit commented 1 year ago

Thanks @rabernat! Let me see what I can track down.

christine-e-smit commented 1 year ago

It looks as though NASA's GES DISC (where I work) archives several GPCP datasets: https://disc.gsfc.nasa.gov/datasets?keywords=gpcp&page=1. Unfortunately, none of these seem to match the temporal and spatial resolution of your dataset. So I've submitted a ticket to our help desk to see if I can find someone who is familiar with the GPCP project as a whole and might be able to help.

christine-e-smit commented 1 year ago

Who is responsible for this data?

Just as I'd hoped, one of our data support scientists, Zhong Liu, was able to solve the mystery of what this data's relationship is with NASA. It seems that you are using an old version of this data and the PI, Bob Adler, has retired from NASA. Since the GES DISC is not the archive for this data, we will contact Bob and the data curator.

Why are there negative values in precipitation?

Zhong, @briannapagan, and I took a quick look at a granule for one of the problematic dates, gpcp_v01r03_daily_d20020907_c20170530.nc, which we downloaded from https://www.ncei.noaa.gov/data/global-precipitation-climatology-project-gpcp-daily/access/2002/. Our best guess is that the fill value was set incorrectly. The metadata says the fill value is -9999, but it looks like this file is accidentally using -99999:

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: ds = xr.open_dataset("gpcp_v01r03_daily_d20020907_c20170530.nc")

In [4]: np.where(ds.precip<0)
Out[4]: 
(array([0, 0, 0, ..., 0, 0, 0]),
 array([  2,   2,   2, ..., 177, 177, 177]),
 array([180, 181, 182, ...,  57,  58,  59]))

In [5]: ds.precip[0][2][180].values
Out[5]: array(-99999., dtype=float32)

In [6]: ds.precip[0][177][58].values
Out[6]: array(-99999., dtype=float32)

How can I get the newest version of this dataset?

The GES DISC archives an updated version of this collection. We have both a daily and a monthly product: https://disc.gsfc.nasa.gov/datasets?keywords=%22GPCPDAY_3.2%20GPCPMON_3.2%22&page=1. I'm guessing that you will be interested in the daily dataset: https://doi.org/10.5067/MEASURES/GPCP/DATA305. This newer version has a higher spatial resolution, at 0.5 degrees.

I know that Brianna recently added the ability to query CMR for granules. You'll want to search for shortname GPCPDAY and version 3.2.

rabernat commented 1 year ago

Thanks so much for investigating! This is so useful and interesting.

We can definitely point Pangeo Forge at the newest version of the dataset. However, NOAA NCEI is still distributing the older, corrupted version. Will you be reaching out directly to NOAA?

briannapagan commented 1 year ago

This is becoming a favorite use case of mine. For data provenance and responsibilities of data producers. As data producers we continuously update and potentially republish potentially erroneous data. One awesome feature in the future of pangeo-forge would be to understand and push those same updates. Data archives are more alive than one might think, so a single static moving of the data isn't sufficient. Not only for datasets still operational, but for corrections like these.

@rabernat we will contact the current data owners of that version.

Also would have been nice to quickly map the lineage of this dataset, without having 'insider' knowledge.

Thanks for catching this and shout out to @christine-e-smit for following up!

christine-e-smit commented 1 year ago

One more metadata comment: the original data's metadata includes a valid_range attribute. This is valid CF-1 metadata (https://cfconventions.org/cf-conventions/cf-conventions.html#missing-data) and would have been sufficient to screen out the values set to -99999. I still think that the missing_value should be consistent with what was actually put in the data, but the valid_range attribute does help.

$ ncdump -h gpcp_v01r03_daily_d20020907_c20170530.nc 
netcdf gpcp_v01r03_daily_d20020907_c20170530 {
dimensions:
    latitude = 180 ;
    longitude = 360 ;
    time = 1 ;
    nv = 2 ;
variables:
    float latitude(latitude) ;
        latitude:long_name = "Latitude" ;
        latitude:standard_name = "latitude" ;
        latitude:units = "degrees_north" ;
        latitude:valid_range = -90.f, 90.f ;
        latitude:axis = "Y" ;
        latitude:bounds = "lat_bounds" ;
...
    float precip(time, latitude, longitude) ;
        precip:long_name = "NOAA Climate Data Record (CDR) of Daily GPCP Satellite-Gauge Combined Precipitation" ;
        precip:standard_name = "lwe_precipitation_rate" ;
        precip:units = "mm/day" ;
        precip:coordinates = "time latitude longitude" ;
        precip:valid_range = 0.f, 100.f ;
        precip:cell_methods = "area: mean time: mean" ;
        precip:missing_value = -9999.f ;

// global attributes:
        :Conventions = "CF-1.6, ACDD 1.3" ;
        :title = "Global Precipitation Climatatology Project (GPCP) Climate Data Record (CDR), Daily V1.3" ;
        :source = "/data1/GPCP_CDR/GPCP_Output/1DD//bin/200209/stfsg3.20020907.s" ;
...
christine-e-smit commented 1 year ago

Thanks so much for investigating! This is so useful and interesting.

We can definitely point Pangeo Forge at the newest version of the dataset. However, NOAA NCEI is still distributing the older, corrupted version. Will you be reaching out directly to NOAA?

We are going to contact Bob since he is the PI. We'll leave it up to him to contact NOAA NCEI.

christine-e-smit commented 1 year ago

Good news! We've gotten a response from the data curator, Guojun Gu, at the University of Maryland:

Hi Zhong,

I just did a quick check. Attached is the data file for 20020907 that is archived in our machine. The missing value should be -99999, not -9999. I will look into other files. Thanks!

Guojun

christine-e-smit commented 1 year ago

This ticket has been more fun for me than it really should be. It's been a while since I've embarked on trying to solve a data mystery and the solution to this one is extremely satisfying.

I agree with @briannapagan that no one should need insider knowledge to get to the bottom of an issue like this. When I've run into references to NASA earth science products I don't recognize (which is most of them!) before, I have used the feedback link at the top of the Earthdata Search website (https://search.earthdata.nasa.gov/search). Whoever is responsible for these tickets knows what products are archived where and knows who to contact. I went straight to the GES DISC help desk this time only because it looked as though the GES DISC had related datasets.

Better lineage would also help, including things like DOIs, which work really well to help find documentation on data products, even if they can't help you to figure out what all happened along the way.