pochedls / xagg

Software to create xml links to underlying CMIP netCDF data
1 stars 1 forks source link

omitted GISS-E2-1-G f2 data? #39

Closed durack1 closed 2 years ago

durack1 commented 2 years ago

I was surprised to see several directories missing XML files, all the f2 directories below. I note that these were listed the last time I ran my code.

source directories:

(base) bash-4.2$ ls -1 ~/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/
r101i1p1f1
r10i1p1f2
r10i1p5f1
r10i1p5f2
r1i1p1f2
r1i1p3f1
r1i1p5f1
r1i1p5f2
r2i1p1f2
r2i1p3f1
r2i1p5f1
r2i1p5f2
r3i1p1f2
r3i1p3f1
r3i1p5f1
r3i1p5f2
r4i1p1f2
r4i1p3f1
r4i1p5f1
r4i1p5f2
r5i1p1f2
r5i1p3f1
r5i1p5f1
r5i1p5f2
r6i1p1f2
r6i1p5f1
r6i1p5f2
r7i1p1f2
r7i1p5f1
r7i1p5f2
r8i1p1f2
r8i1p5f1
r8i1p5f2
r9i1p1f2
r9i1p5f1
r9i1p5f2

And XML matches:

(base) bash-4.2$ ls -1 *GISS-E2-1-G*
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G-CC.r102i1p1f1.mon.mrro.land.glb-2d-gn.v20220115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r101i1p1f1.mon.mrro.land.glb-2d-gn.v20220115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r10i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r1i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r1i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r2i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r2i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r3i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r3i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r4i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r4i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r5i1p3f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r5i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r6i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r7i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r8i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r9i1p5f1.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
durack1 commented 2 years ago

It seems that for GISS-E2-1-H no such issue is apparent:

source directories:

(base) bash-4.2$ ls -1 /p/css03/esgf_publish/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-H/ssp245/
r1i1p1f2
r1i1p3f1
r2i1p1f2
r2i1p3f1
r3i1p1f2
r3i1p3f1
r4i1p1f2
r4i1p3f1
r5i1p1f2
r5i1p3f1

And XML matches:

(base) bash-4.2$ ls -1 *GISS-E2-1-H*
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r1i1p1f2.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r1i1p3f1.mon.mrro.land.glb-2d-gn.v20201215.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r2i1p1f2.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r2i1p3f1.mon.mrro.land.glb-2d-gn.v20201215.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r3i1p1f2.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r3i1p3f1.mon.mrro.land.glb-2d-gn.v20201215.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r4i1p1f2.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r4i1p3f1.mon.mrro.land.glb-2d-gn.v20201215.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r5i1p1f2.mon.mrro.land.glb-2d-gn.v20200115.0000000.0.xml
CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-H.r5i1p3f1.mon.mrro.land.glb-2d-gn.v20201215.0000000.0.xml

As a sanity check, for GISS-E2-1-G...r1i1p1f2 https://github.com/pochedls/xagg/issues/39#issue-1326474467, valid data does exist:

(base) bash-4.2$ ls -1 ~/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/
mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc
mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_205101-210012.nc
pochedls commented 2 years ago

The first directory I looked at seems to have been retracted:

/p/css03/esgf_publish/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Amon/tas/gn/v20200115/

/p/user_pub/xclim/retracted/CMIP6.ScenarioMIP.ssp245.NASA-GISS.GISS-E2-1-G.r1i1p1f2.mon.tas.atmos.glb-z1-gn.v20200115.0000000.0.xml

Perhaps f2 is retracted? If so, can this be closed?

durack1 commented 2 years ago

Interesting that you note it as retracted, it seems to be live on ESGF: https://esgf-node.llnl.gov/search/cmip6/?source_id=GISS-E2-1-G&variant_label=r1i1p1f2&variable_id=mrro&experiment_id=ssp245&table_id=Lmon The version number appears to line up as well: https://dpesgf03.nccs.nasa.gov/thredds/fileServer/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/***v20200115***/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc

durack1 commented 2 years ago

But this does note it is retracted https://errata.es-doc.org/static/view.html?uid=522ff4de-b4f3-18e6-a892-d6acd1b21847 - so which is right?

pochedls commented 2 years ago

Doesn't this mean that NASA is supposed to remove the dataset?

durack1 commented 2 years ago

@pochedls I raised a query with NASA to ascertain which info is correct, my guess the retraction on the nccs.nasa.gov node didn't happen, but will update this thread when I have heard back.

Out of curiosity, where are you polling for the latest retraction info?

durack1 commented 2 years ago

Got a reply back from NASA/Gavin, I'll pull down some files today and compare the checksums/tracking_id's to ascertain whether this data is good or bad.

From: Schmidt, Gavin A.
Date: Wednesday, August 3, 2022 at 6:32 AM
To: Durack, Paul J.
Subject: Re: [EXTERNAL] GISS-E2-1-G retracted data?
The data was replaced, but it somehow kept the same date? The data that is there is good.
Sorry for the confusion.

Gavin

--

Gavin A. Schmidt
Director, Goddard Institute for Space Studies
2880 Broadway, New York NY 10025

From: "Durack, Paul J."
Date: Wednesday, August 3, 2022 at 2:33 AM
To: "Schmidt, Gavin A."
Subject: [EXTERNAL] GISS-E2-1-G retracted data?

Hi Gavin, I have a CMIP6 data query to point at you.

On the ES-DOC errata page, it seems that numerous SSPx datasets were retracted:
[https://errata.es-doc.org/static/view.html?uid=522ff4de-b4f3-18e6-a892-d6acd1b21847](https://urldefense.us/v3/__https:/gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Ferrata.es-doc.org*2Fstatic*2Fview.html*3Fuid*3D522ff4de-b4f3-18e6-a892-d6acd1b21847&data=05*7C01*7CGavin.A.Schmidt*40NASA.GOV*7C329fa55470884a5e8af108da751a12f6*7C7005d45845be48ae8140d43da96dd17b*7C0*7C0*7C637951052091004726*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C2000*7C*7C*7C&sdata=hu*2BbwJqYCZKJtZ*2BivcIQtflb*2FL6KYknezO2zkqWtxI0*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!G2kpM7uM-TzIFchu!lEjTzhhIQEmPXNreDew8qhGxCFGTDvBL2wpg_2nVhxbfd7Lyb7mhwbNeNgvUiQ76$)

It seems that these data, with the exact version/date (20200115) are still live on the NASA ESGF node
(the subset that I was watching can be linked at):
[https://esgf-node.llnl.gov/search/cmip6/?source_id=GISS-E2-1-G&variant_label=r1i1p1f2&variable_id=mrro&experiment_id=ssp245&table_id=Lmon](https://urldefense.us/v3/__https:/gcc02.safelinks.protection.outlook.com/?url=https*3A*2F*2Fesgf-node.llnl.gov*2Fsearch*2Fcmip6*2F*3Fsource_id*3DGISS-E2-1-G*26variant_label*3Dr1i1p1f2*26variable_id*3Dmrro*26experiment_id*3Dssp245*26table_id*3DLmon&data=05*7C01*7CGavin.A.Schmidt*40NASA.GOV*7C329fa55470884a5e8af108da751a12f6*7C7005d45845be48ae8140d43da96dd17b*7C0*7C0*7C637951052091004726*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C2000*7C*7C*7C&sdata=Ghyy3h*2B4EAuj67DI*2BhTQsPjP2ZvqZqR7KakCvF2dDkE*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!G2kpM7uM-TzIFchu!lEjTzhhIQEmPXNreDew8qhGxCFGTDvBL2wpg_2nVhxbfd7Lyb7mhwbNeNlnHxrk-$)

Which links to data published on the NASA node, e.g.:
[https://dpesgf03.nccs.nasa.gov/thredds/fileServer/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc](https://urldefense.us/v3/__https:/dpesgf03.nccs.nasa.gov/thredds/fileServer/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc__;!!G2kpM7uM-TzIFchu!lEjTzhhIQEmPXNreDew8qhGxCFGTDvBL2wpg_2nVhxbfd7Lyb7mhwbNeNscMQYqA$)

Just wondering, is this an issue your side with the retracted data not unpublished, or is the data valid?

Cheers,

P
durack1 commented 2 years ago

This is relevant:

From: Zelinka, Mark
Date: Tuesday, August 2, 2022 at 9:02 AM
To: Po-Chedley, Stephen, Ames, Sasha
Subject: Re: CMIP search capabilities

Hi Sasha,

I’ve been turning up some interesting behavior when searching for the “correct” paths to our CMIP data, and Steve and I thought you might have some insights.  One example is below.  

Searching for data for GISS-E2-1-G, ssp245, r2i1p3f1, Amon, tas can yield the following path:

/p/css03/scratch/cmip6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r2i1p3f1/Amon/tas/gn/v20200115/

Which contains:
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_210101-215012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_215101-220012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_220101-225012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_225101-230012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_230101-235012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_235101-240012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_240101-245012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_245101-250012.nc

But if your search prioritizes other criteria, it may turn up this path:
/p/css03/esgf_publish/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r2i1p3f1/Amon/tas/gn/v20200115/

Which contains:
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_201501-205012.nc
tas_Amon_GISS-E2-1-G_ssp245_r2i1p3f1_gn_205101-210012.nc

Both of these are “right” in the sense that search 2 turns up the first part of the data (2015 – 2100, which resides on esgf_publish) and search 1 turns up the latter part (2100 – 2500, which is of longer duration but is on scratch).  I don’t know how common this is, where (presumably valid) non-overlapping data lives in two separate places.  But it would seem that perhaps these should live together in the same directory.  Is only the data stored in /esgf_publish/ actually published to ESGF, whereas the data on /scratch/ is not?

Mark
durack1 commented 2 years ago

This is what I found, noting that the errata was first logged on 2020-03-03 15:39:28:

local ESGF holdings, appear to be deprecated (pre-errata log) files:

$ ncdump -h ~/esgf_publish/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc | grep tracking_id
        :tracking_id = "hdl:21.14100/b57575b2-e560-4052-bc59-1160b4270e85" ;
        :creation_date = "2020-02-06T15:05:35Z" ;
$ ncdump -h ~/esgf_publish/CMIP6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_205101-210012.nc | grep tracking_id
        :tracking_id = "hdl:21.14100/813966a6-7dcb-4f26-b59c-22b8126a0811" ;
        :creation_date = "2020-02-06T00:18:59Z" ;

Whereas the latest downloads, direct from NASA appear newer (from https://dpesgf03.nccs.nasa.gov v20200115)

$ ncdump -h ~/Downloads/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_201501-205012.nc | grep tracking_id
        :tracking_id = "hdl:21.14100/98f69897-cb22-4974-a2b8-c916c5348609" ;
        :creation_date = "2020-03-15T15:16:41Z" ;
$ ncdump -h ~/Downloads/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_205101-210012.nc | grep tracking_id
        :tracking_id = "hdl:21.14100/2a8ac05d-2b38-418e-a6db-58157cc4db77" ;
        :creation_date = "2020-03-15T15:41:29Z" ;

The newest temporal extension (scratch) data presumably corresponds to the corrected and updated (2020-03-15) data found available from the NASA ESGF node (December 2020 for the 2101-> data, compared to March 2020 for the 2015-2100, compared to February 2020 for the deprecated 2015->2100 data that we still have in esgf_publish):

$ ncdump -h ~/scratch/cmip6/ScenarioMIP/NASA-GISS/GISS-E2-1-G/ssp245/r1i1p1f2/Lmon/mrro/gn/v20200115/mrro_Lmon_GISS-E2-1-G_ssp245_r1i1p1f2_gn_210101-215012.nc | grep creation_date
        :creation_date = "2020-12-08T02:31:56Z" ;

@sashakames @mzelinka ping

@pochedls I will close this issue as it's not a software problem, but data management issue

durack1 commented 2 years ago

@pochedls actually thinking about this, it IS a software issue if the retraction step is encoded

sashakames commented 2 years ago

Wow, didn't realize it got that bad, so NASA reused the same version twice, first to correct a problem when they should have retracted on the index, then again to extend the output....

durack1 commented 2 years ago

Wow, didn't realize it got that bad, so NASA reused the same version twice, first to correct a problem when they should have retracted on the index, then again to extend the output....

Yep, this is one of the reasons that CMOR dynamically allocates version stamps by default - obviously NASA-GISS must have hardcoded a bunch of stuff

@mauzey1 @taylor13 @matthew-mizielinski ping as an FYI - and linking to relevant threads https://github.com/PCMDI/cmor/issues/210#issuecomment-317893707 and https://github.com/PCMDI/cmor/issues/267

pochedls commented 2 years ago

This LLNL software library did retire/retract the data based on the following metadata: mip, activity, institution, model, experiment, realization, table, realm, frequency, variableId, grid, gridLabel, version.

I'm not sure I have the bandwidth to increase the complexity of the xagg retraction process right now (to recognize that a data provider published good and retracted data with the same metadata, possibly outside of the recommended publishing methods). So I am going to keep this closed.

I suggest moving this thread to email.