Closed durack1 closed 4 years ago
@durack1 - This is extracted from the path. It appears this only applies to FGOALS-f3-L
and FGOALS-g3
. I'm not sure if the problem is on our end or in how FGOALS saved their data. I think the CMIP
xmls point to scratch and the ScenarioMIP
xmls point to publish (in the subset I looked at).
@taylor13 and @sashakames this might involve WIP/ESGF as we may have discovered an issue, which may need an errata raised.
Taking a peek at the errata, we have no entries for FGOALS-f3-L
or for FGOALS-g3
Spot checking on ESGF, it appears that all the metadata has ScenarioMIP, not CMIP, so it might be a problem with the directory structure on scratch?
Ok so for a randomly selected incorrect-MIP identified file we have:
-bash-4.2$ more ../xclim/CMIP6/CMIP/ssp585/atmos/mon/tas/CMIP6.CMIP.ssp585.CAS.FGOALS-g3.r1i1p1f1.mon.tas.atmos.glb-z1-gn.v20190818.0000000.0.xml | grep directory
directory ="../scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp585/r1i1p1f1/Amon/tas/gn/v20190818/"
(cdat821rc1py3) bash-4.2$ ncdump -h ../scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp585/r1i1p1f1/Amon/tas/gn/v20190818/tas_Amon_FGOALS-g3_ssp585_r1i1p1f1_gn_201501-201912.nc
netcdf tas_Amon_FGOALS-g3_ssp585_r1i1p1f1_gn_201501-201912 {
dimensions:
time = UNLIMITED ; // (60 currently)
lat = 80 ;
lon = 180 ;
bnds = 2 ;
variables:
...
// global attributes:
:Conventions = "CF-1.7 CMIP-6.2" ;
:activity_id = "ScenarioMIP" ;
:branch_method = "standard" ;
:branch_time_in_child = 0. ;
:branch_time_in_parent = 60225. ;
:contact = "Lijuan Li (ljli@mail.iap.ac.cn)" ;
:creation_date = "2019-08-18T13:08:09Z" ;
:data_specs_version = "01.00.31" ;
:experiment = "update of RCP8.5 based on SSP5" ;
:experiment_id = "ssp585" ;
:external_variables = "areacella" ;
:forcing_index = 1 ;
:frequency = "mon" ;
:further_info_url = "https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-g3.ssp585.none.r1i1p1f1" ;
:grid = "native atmosphere area-weighted latxlon grid (80x180 latxlon)" ;
:grid_label = "gn" ;
:history = "2019-08-18T13:04:31Z ;rewrote data to be consistent with ScenarioMIP for variable cl found in table Amon." ;
:initialization_index = 1 ;
:institution = "Chinese Academy of Sciences, Beijing 100029, China" ;
:institution_id = "CAS" ;
:mip_era = "CMIP6" ;
:nominal_resolution = "250 km" ;
:parent_activity_id = "CMIP" ;
:parent_experiment_id = "historical" ;
:parent_mip_era = "CMIP6" ;
:parent_source_id = "FGOALS-g3" ;
:parent_time_units = "days since 1850-01-01" ;
:parent_variant_label = "r1i1p1f1" ;
:physics_index = 1 ;
:product = "model-output" ;
:realization_index = 1 ;
:realm = "atmos" ;
:run_variant = "3rd realization" ;
:source = "FGOALS-g3 (2017): \n",
"aerosol: none\n",
"atmos: GAMIL2 (180 x 90 longitude/latitude; 26 levels; top level 2.19hPa)\n",
"atmosChem: none\n",
"land: CLM4.0\n",
"landIce: none\n",
"ocean: LICOM3.0 (LICOM3.0, tripolar primarily 1deg; 360 x 218 longitude/latitude; 30 levels; top grid cell 0-10 m)\n",
"ocnBgchem: none\n",
"seaIce: CICE4.0" ;
:source_id = "FGOALS-g3" ;
:source_type = "AOGCM" ;
:sub_experiment = "none" ;
:sub_experiment_id = "none" ;
:table_id = "Amon" ;
:table_info = "Creation Date:(24 July 2019) MD5:3039b0071259358b3c55557c5f3d21bf" ;
:title = "FGOALS-g3 output prepared for CMIP6" ;
:tracking_id = "hdl:21.14100/03609f1e-62da-4fee-996f-c41f8a2488d3" ;
:variable_id = "tas" ;
:variant_label = "r1i1p1f1" ;
:license = "CMIP6 model data produced by Lawrence Livermore PCMDI is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
:cmor_version = "3.5.0" ;
}
So it looks to me like the metadata/global attributes are correct, but the path is not, which means we may need to redirect the scans to the global atts to get around these inconsistencies with problem paths
The above was also confirmed with
../xclim/CMIP6/CMIP/ssp245/atmos/mon/pr/CMIP6.CMIP.ssp245.CAS.FGOALS-g3.r1i1p1f1.mon.pr.atmos.glb-2d-gn.v20190818.0000000.0.xml
I just took a look at the alternative model FGOALS-f3-L
and we get
(cdat821rc1py3) bash-4.2$ more ../xclim/CMIP6/CMIP/ssp126/land/mon/mrfso/CMIP6.CMIP.ssp126.CAS.FGOALS-f3-L.r1i1p1f1.mon.mrfso.land.glb-2d-gn.v20190821.0000000.0.xml | grep directory
directory ="../esgf_publish/CMIP6/CMIP/CAS/FGOALS-f3-L/ssp126/r1i1p1f1/Lmon/mrfso/gn/v20190821/"
(cdat821rc1py3) bash-4.2$ ncdump -h ../esgf_publish/CMIP6/CMIP/CAS/FGOALS-f3-L/ssp126/r1i1p1f1/Lmon/mrfso/gn/v20190821/mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012.nc
netcdf mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012 {
dimensions:
time = UNLIMITED ; // (1032 currently)
lat = 192 ;
lon = 288 ;
bnds = 2 ;
variables:
...
// global attributes:
:Conventions = "CF-1.7 CMIP-6.2" ;
:activity_id = "ScenarioMIP" ;
:branch_method = "standard" ;
:branch_time_in_child = 59400. ;
:branch_time_in_parent = 59400. ;
:creation_date = "2019-08-21T02:01:46Z" ;
:data_specs_version = "01.00.30" ;
:experiment = "update of RCP2.6 based on SSP1" ;
:experiment_id = "ssp126" ;
:external_variables = "areacella" ;
:forcing_index = 1 ;
:frequency = "mon" ;
:grid = "native atmosphere regular grid (3x4 latxlon)" ;
:grid_label = "gn" ;
:initialization_index = 1 ;
:institution = "Chinese Academy of Sciences, Beijing 100029, China" ;
:institution_id = "CAS" ;
:mip_era = "CMIP6" ;
:nominal_resolution = "10000 km" ;
:parent_activity_id = "CMIP" ;
:parent_experiment_id = "historical" ;
:parent_mip_era = "CMIP6" ;
:parent_source_id = "FGOALS-f3-L" ;
:parent_time_units = "days since 2015-01-01" ;
:parent_variant_label = "r1i1p1f1" ;
:physics_index = 1 ;
:product = "model-output" ;
:realm = "land" ;
:run_variant = "3rd realization" ;
:source = "FGOALS-f3-L (2017): \n",
"aerosol: none\n",
"atmos: FAMIL2.2 (Cubed-sphere, c96; 360 x 180 longitude/latitude; 32 levels; top level 2.16 hPa)\n",
"atmosChem: none\n",
"land: CLM4.0\n",
"landIce: none\n",
"ocean: LICOM3.0 (LICOM3.0, tripolar primarily 1deg; 360 x 218 longitude/latitude; 30 levels; top grid cell 0-10 m)\n",
"ocnBgchem: none\n",
"seaIce: CICE4.0" ;
:source_id = "FGOALS-f3-L" ;
:source_type = "AOGCM ISM AER" ;
:sub_experiment = "none" ;
:sub_experiment_id = "none" ;
:table_id = "Lmon" ;
:table_info = "Creation Date:(09 May 2019) MD5:cde930676e68ac6780d5e4c62d3898f6" ;
:title = "FGOALS-f3-L output prepared for CMIP6" ;
:tracking_id = "hdl:21.14100/5c5a98cc-aab9-420f-9ded-cc5ac35931c2" ;
:variable_id = "mrfso" ;
:license = "CMIP6 model data produced by Lawrence Livermore PCMDI is licensed under a Creative Commons Attribution ShareAlike 4.0 International License (https://creativecommons.org/licenses). Consult https://pcmdi.llnl.gov/CMIP6/TermsOfUse for terms of use governing CMIP6 output, including citation requirements and proper acknowledgment. Further information about this data, including some limitations, can be found via the further_info_url (recorded as a global attribute in this file) and at https:///pcmdi.llnl.gov/. The data producers and data providers make no warranty, either express or implied, including, but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
:cmor_version = "3.4.0" ;
:variant_label = "r1i1p1f1" ;
:realization_index = "1" ;
:further_info_url = "https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-f3-L.ssp126.none.r1i1p1f1" ;
:history = "Thu Sep 26 09:19:19 2019: ncatted -O -a further_info_url,global,m,c,https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-f3-L.ssp126.none.r1i1p1f1 mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012.nc\n",
"Thu Sep 26 09:02:47 2019: ncatted -O -a further_info_url,global,m,c,https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-f3-L.1pctCO2.none.r1i1p1f1 mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012.nc\n",
"Thu Sep 26 09:02:34 2019: ncatted -O -a realization_index,global,m,c,1 mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012.nc\n",
"Thu Sep 26 09:02:22 2019: ncatted -O -a variant_label,global,m,c,r1i1p1f1 mrfso_Lmon_FGOALS-f3-L_ssp126_r1i1p1f1_gn_201501-210012.nc\n",
"2019-08-21T02:01:46Z ;rewrote data to be consistent with ScenarioMIP for variable mrfso found in table Lmon." ;
}
So same story, metadata correct, paths not
Hawkeye @taylor13 may have eagle-eye-spotted a problem with the file directly above, 3 guesses, starts, .... now.
took about 3 minutes .... The branch time in child is inconsistent with the units and file name, which indicate it should be near 0, not 59400.
also parent_time_units are wrong.
Good catches, but not the issue I was eyeing off:
:nominal_resolution = "10000 km" ;
doesn't fit too well with the grid atmos: FAMIL2.2 (Cubed-sphere, c96; 360 x 180 longitude/latitude;
, it's probably more like 100 km
also the nominal_resolution is too large
too late, I guess.
It was a third guess, I suppose you slipped just under the cutoff
@pochedls if WE were to implement this dir-scour to metadata-scour change, we'd likely need to archive and then rerun the whole tree. How's the appetite for such an undertaking, and how many days are we talking here?
This problem is upstream of xagg (as far as I can tell) and in the end doesn't cause any problems (the published datasets have xml files in the correct place).
I don't think it is worth investing time in changing xagg for two reasons: 1) I don't think this is a problem for anyone using the xmls (again, the xml files corresponding to the published data are in the correct location) and 2) I think if we infer the activity from the netcdf files xagg will be substantially slower and I think re-factoring the code may lead to other (potentially more substantive) problems that will take time to resolve.
I will mark these files as ignored
and we can revisit this issue if it does cause legitimate problems in accessing the correct data. Let me know if you object.
FYI - that this issue appears to affect 50 directories. Also note that by "ignoring" the dataset, it will remove the xml and not scan it in the future.
Relevant query:
select path from paths where xmlFile like '/p/user_pub/xclim/CMIP6/CMIP/%' and experiment = 'ssp585';
Looking at all of the misplaced scenarioMIP data, there are 207 datasets (50 are in ssp585). Of them, all but five have a corresponding dataset in the correct location (e.g., under scenarioMIP rather than CMIP) except for these five (xml file followed by the underlying directory):
/p/user_pub/xclim/CMIP6/CMIP/ssp585/atmos/mon/mc/CMIP6.CMIP.ssp585.CAS.FGOALS-g3.r1i1p1f1.mon.mc.atmos.glb-l-gn.v20190818.0100000.0.xml /p/css03/scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp585/r1i1p1f1/Amon/mc/gn/v20190818/
/p/user_pub/xclim/CMIP6/CMIP/ssp126/atmos/mon/hur/CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.hur.atmos.glb-p19-gn.v20190818.0000000.0.xml /p/css03/scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/hur/gn/v20190818/
/p/user_pub/xclim/CMIP6/CMIP/ssp126/atmos/mon/hus/CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.hus.atmos.glb-p19-gn.v20190818.0000000.0.xml /p/css03/scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/hus/gn/v20190818/
/p/user_pub/xclim/CMIP6/CMIP/ssp126/atmos/mon/clt/CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.clt.atmos.glb-2d-gn.v20190818.0000000.0.xml /p/css03/esgf_publish/CMIP6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/clt/gn/v20190818/
/p/user_pub/xclim/CMIP6/CMIP/ssp126/atmos/mon/huss/CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.huss.atmos.glb-z1-gn.v20190818.0000000.0.xml /p/css03/scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/huss/gn/v20190818/
@painter1 @sashakames I wonder if the scratch
issues above can easily be fixed, by copying (while fixing the incorrect activity_id: CMIP -> ScenarioMIP
) to esgf_publish
?
The single published (esgf_publish
) issue FGOALS-g3/ssp126 should really be fixed, though on a low priority, as I am sure these data are already being downloaded by others
I would rather see this kind of mistake fixed at the source, i.e. retract and publish correctly. If we were to put the FGOALS data somewhere different then anybody looking at ESGF might see similar files in different places and not know what to make of it.
Well the problem is that they published this incorrectly by manually putting the ssp's under CMIP and esgmapfile/esgpublish doesn't have the sophisticated hierarchical check. If we correct this on our end, the datasets won't be replicas on ESGF, they will have distinct dataset IDs and that could be confusing or problematic to end-users.
We have the new CMIP Inconsistency Checker (the CIC) and could put these check in to original published ESGF data.
@painter1 @sashakames ok so this requires an errata raised. I'll put that in the to-do list
Did the path format stay the same during publication? In the directories below, it looks like the MIP part of the path is different between scratch and publish.
/p/css03/scratch/cmip6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/tas/gn/v20190818/
versus
/p/css03/esgf_publish/CMIP6/ScenarioMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/tas/gn/v20190818/
@pochedls good catch, either way I have sent an email (below) to raise an errata, if it is indeed an issue this should be logged. I will close this issue now as it is not a problem with software
From: "Durack, Paul J."
Date: Monday, August 3, 2020 at 12:40 PM
To: ljli@mail.iap*, yyq@lasg.iap*, zhanghe@mail.iap*, zhengwp@mail.iap*, bixq@mail.iap*, mhzhang@mail.iap*,
zhoutj@LASG.IAP*
Subject: FGOALS-g3 and FGOALS-f3-L CMIP6/ESGF publication problems
Hello from California.
I have reached out to you all as contacts listed for the CAS contributions to the CMIP6 simulation archive.
We discovered some problems with the publication paths for ScenarioMIP data contributed for the
FGOALS-f3-L and FGOALS-g3 models.
For the experiments ssp126, ssp245, ssp370 and ssp585, some simulation data has been erroneously
published under the “CMIP” activity_id, whereas these experiments belong to the “ScenarioMIP” activity_id.
So for e.g. the publication path
cmip6/CMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/tas/gn/v20190818/
Should be
cmip6/ScenarioMIP/CAS/FGOALS-g3/ssp126/r1i1p1f1/Amon/tas/gn/v20190818/
Could you please raise an errata at https://errata.es-doc.org/static/index.html noting that these path
issues exist, and how you plan to resolve such problems, and unpublish erroneously published data and paths.
Many thanks in advance.
P
One thing I do not understand is that in @sashakames @painter1's comment and comment: it seems like this path issue was due to the modeling center's publication choices and that this mistake should be propagated to the esgf_publish directories.
The point of my comment is that this does not appear to be the case. For the same version, the scratch and publish paths are different (in the mip
part of the path). Are we fixing the paths as they are published or was this a strange series of events that led to this difference in the paths.
@pochedls I think this was an issue that CAS is aware of, they noted it, caught it and it's now fixed, but in between those times Jeff replicated the data, and the problem. What should have happened, is that the version was incremented when the fix was made, but it looks like it wasn't. If we could get creation dates off the host filesystem it would likely tell us exactly when the fix was implemented but looks like that wasn't the case
I suppose Jeff corrected our copies. checked and CAS deleted (not retracted) and so there is no record anymore in ESGF.
@sashakames and @pochedls I did have the same question. Or rather, did the CAS fix also get replicated/duplicated, which means we have data that resides in esgf_publish that has the right activity_id
In the example of @pochedls , I did not make any corrections. It's all automated. To understand it, just looked at one of the files, tas_Amon_FGOALS-g3_ssp126_r1i1p1f1_gn_202001-202912.nc.
This file was first downloaded was in the 'CMIP' activity, October 21. It never got published because not all the files in its dataset were downloaded. Just one of them is missing; the database lists it as 'retracted' now, so it must have been deleted or retracted before it could be downloaded. Retraction would have made it disappear from ESGF, but I don't have a script to automatically delete files from LLNL, so the file remains at LLNL and in the Synda database.
The second copy was published again with the same version number in the 'ScenarioMIP' activity. Then it was downloaded on October 26. It happens to be the last file needed to complete its dataset, so the dataset was probably moved to esgf_publish/ that night (I haven't checked for that).
In short, the system saw the two copies of the files as completely different files because they were in different places. You would have to compare the checksums, or be an intelligent human, to suspect that they are the same.
Looking through these files in our Synda database, I see that only the dataset and one of its files has been marked as 'retracted'. In the near future I will have to see why; it looks like I have a bug. But even if all the files had all been correctly marked, you could still see them in the file system.
Thanks @painter1, this is good intel, as the publication version/timestamp is v20190818
whereas this change/correction occurred sometime between the 21st and 26th October 2019, so clearly they haven't followed the protocol as specified.
As an aside their global atts in https://github.com/pochedls/xagg/issues/29#issuecomment-666739607 show "2019-08-21T02:01:46Z ;rewrote data to be consistent with ScenarioMIP for variable mrfso found in table Lmon." ;
which jives with this publish/retract/move/publish timeline that you defined.
I believe this is a case closed situation, however we do have one (or was it 5) dataset(s) that are missing still as noted in https://github.com/pochedls/xagg/issues/29#issuecomment-666855060
I just looked at the five datasets in @pochedls's comment in #29.
For the first one, the problem is that /p/css03/scratch/cmip6/ and CMIP6/ point to the same place; but /p/css03/esgf_publish has only CMIP6/. So the dataset is published, but at /p/css03/esgf_publish/CMIP6/ScenarioMIP/CAS/FGOALS-g3/ssp585/r1i1p1f1/Amon/mc/gn/v20190818/
For the remaining four, the problem is that ICHEC hasn't published the corrected dataset yet. At least, they isn't shown on the ESGF search page (with table 'day' rather than 'Amon', you can find them.)
So I agree that we have a "case closed" situation.
Oops: my comment on the first of the five datasets is really the same as Steve's comment right after he listed the five problem datasets.
The official name is CMIP6, but I added a link 'cmip6' in scratch/ to save the trouble of hitting the shift key. Maybe that was a mistake.
The issue is closed, but I did have one point to clarify. @pochedls has noted we are missing 5 "fixed" datasets:
CMIP6.CMIP.ssp585.CAS.FGOALS-g3.r1i1p1f1.mon.mc.atmos.glb-l-gn.v20190818.0100000.0.xml CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.hur.atmos.glb-p19-gn.v20190818.0000000.0.xml CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.hus.atmos.glb-p19-gn.v20190818.0000000.0.xml CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.clt.atmos.glb-2d-gn.v20190818.0000000.0.xml CMIP6.CMIP.ssp126.CAS.FGOALS-g3.r1i1p1f1.mon.huss.atmos.glb-z1-gn.v20190818.0000000.0.xml
It seems that the 4 datasets exist for ssp126 and mc
for ssp585 but we don't have these locally
@durack1 , the "missing" datasets were ScenarioMIP copies of the CMIP datasets you listed. All of them exist locally as part of the CMIP activity, but the correct activity for the ssp* experiments is ScenarioMIP. As I pointed out yesterday, we actually have the ScenarioMIP copy of the ssp585 mc dataset - but the high-level directory name is the usual CMIP6, not cmip6. The other four datasets do not exist on ESGF. ICHEC, or whoever it was that originally published the data as CMIP, never finished making the correction.
@painter1 sorry if I am missing something, but if you click the links above (which points to ESGF) you'll see that all the datasets exist under the ScenarioMIP activity, as they should, so CAS (or ICHEC) has done the cleanup. It's just our local files don't reflect this, and we don't have the ScenarioMIP directory bound files, only the incorrect path files
After clicking on that ssp126 link, specify a time frequency of 'mon' and realm of 'atmos'. Then look for the variable names hur, hus, clt, huss. I can't find them. One cause of confusion here may be that a CMIP6 dataset has only one variable, unlike a CMIP5 dataset.
@painter1 apologies for wasting your time, it seems at least for clt
this is daily data only. Case closed.
@pochedls I was just starting to take a look at the various MIP datasets, and stumbled upon
Is the duplication of
ssp126 - > ssp585
intentional?