Closed charlesbluca closed 4 years ago
Fantastic @charlesbluca. Thanks so much for getting this started. I think we'll want @naomi-henderson to review.
One open question is whether we need any array type columns.
@charlesbluca , all very good questions! Here are my attempts at answers:
dcpp_start_year is always an integer - but my guess is that @andersy005 defined it as float since the missing value is np.nan and it is inefficient to mix integers and floats in a dataframe column. I think we can let it be integer.
serious
is just jargon for whatever we want it to mean. We actually use the status, severity and issue_uid to decide what to allow in the regular CSV file. Here is my current code to illustrate
dz_exclude = dz_exclude[dz_exclude.status != 'resolved']
dz_exclude = dz_exclude[dz_exclude.severity != 'low']
dz_exclude = dz_exclude[dz_exclude.issue_uid != 'b6302400-3620-c8f1-999b-d192c0349084']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '45f9e7b9-1844-7a92-8b54-10d954e621db']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '4aa40e49-b2d4-0b29-a6b1-c80ee8dce11a']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '2f6b5963-f87e-b2df-a5b0-2f12b6b68d32']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '61fb170e-91bb-4c64-8f1d-6f5e342ee421']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '90cac29b-eaff-c450-8621-ea31e305a40e']
dz_exclude = dz_exclude[dz_exclude.issue_uid != '8fbd8df5-c349-315b-9ec3-5a2f4ec4ec63']
dz_exclude = dz_exclude[~((dz_exclude.issue_uid == 'eb69632c-a6e2-7667-a112-a98b7745e2ea') &(dz_exclude.member_id == 'r4i1p1f1') &(dz_exclude.experiment_id == 'ssp370') &(dz_exclude.table_id == 'Amon'))]
- Please use 'string' for the version. The controlled vocabulary actually prepends a 'v', as in 'v20190204' (though at one point one model used 'v1').
Two array type columns we might want are for tracking_id
(which is currently a concatenated list of the tracking_ids from all data combined into the zarr store) and the related column suggested by @rabernat, source_URLs.
Oh, there is another column that would be useful, perhaps called dataset_id
. Each dataset - which corresponds to one 8-tuple + version - has a unique dataset tracking_id
, which is different from a netcdf tracking_id
. In a publication one would use this dataset tracking_id
to reference all years of the same variable of a particular version of a run rather than a list of all of the composite netcdf tracking_ids (which can be ridiculously long).
Each dataset - which corresponds to one 8-tuple + version - has a unique
dataset tracking_id
, which is different from anetcdf tracking_id
So how about
{
"description": "Unique ID to reference all years of the same variable of a particular version of a run",
"mode": "REQUIRED",
"name": "dataset_tracking_id",
"type": "STRING"
},
{
"description": "All of the netCDF tracking IDs that have gone into this record",
"mode": "REPEATED",
"name": "netcdf_tracking_ids",
"fields": [
{
"name": "netcdf_tracking_id",
"type": "STRING",
"mode": "NULLABLE"
} ]
}
I based this guess on this example.
I hesitate to introduce any more special considerations, but better to get it correct from the beginning. It has to do with the dcppA-hindcast
experiments which are prediction experiments with each run starting at many different initialization years.
Prior to the hackathon, the intake-esm
folks decided to split the member_id
into two parts ('member_id' and 'dcpp_init_year') to aid in the parsing of dcppA-hindcast
datasets. Unfortunately they used the same name for the column as for the original member_id
keyword. For example, the following dataset has a member_id = 's1961-r1i1p1f1'
, but it is given in the CSV file with the values:
zstore = `gs://cmip6/DCPP/BCC/BCC-CSM2-MR/dcppA-hindcast/s1961-r1i1p1f1/Amon/tas/gn/`
member_id = 'r1i1p1f1'
dcpp_init_year = 1961
Those interested in this experiment will probably want to use this extra intake-esm
functionality and take advantage of dcpp_init_year
, so we probably shouldn't change this now. However, we should rethink this for the BigQuery database. I would rather do it correctly.
Here is the CV description of member_id
:
if sub_experiment_id = “none”
member_id = <variant_label>
else
member_id = <sub_experiment_id>-<variant_label>
endif
To do this correctly, the BigQuery database would use the original member_id
with optional sub_experiment_id
and variant_label
columns. The CSV file can then be parsed from this, using the intake-esm
re-naming. Note that variant_label
is member_id
in the CSV file and sub_experiment
is either "none" or "s????" where "????" is the dcpp_init_year
in the CSV.
@charlesbluca - are you clear on how to revise your schema based on @naomi-henderson's feedback? If so, could you update your PR?
I'd like to get this moving forward.
I believe so - I've switched from the required member_id
to a nullable sub_experiment_id
and required variant_label
, and added dataset_tracking_id
and netcdf_tracking_ids
as additional fields.
I think we can get away with defining netcdf_tracking_ids
as a simple repeated string field:
{
"description": "All of the netCDF tracking IDs that have gone into this record",
"mode": "REPEATED",
"name": "netcdf_tracking_ids",
"type": "STRING"
}
Unless we plan to have associated metadata for each individual tracking ID, in which case an array of records would be more fitting.
I was thinking of adding the two new columns, subexperiment_id
and variant_label
, in addition to member_id
, but perhaps this is better? Are we trying to avoid redundancy be eliminating the member_id
column? (member_id
which is used in the path/names of the zarr stores.
@charlesbluca , I don't know of any associated metadata for netcdf_tracking_ids - so that looks fine to me.
That was a misunderstanding on my part - wasn't sure if you wanted to be more true to the CV description by eliminating it in favor of the two fields. If the prime concern is compatibility with intake-esm, would a suitable solution to be rename the variant_label
column to member_id
, and clarify in its description that it is the variant_label
by the CV description?
Ah, the true CV description has all three keywords. No, please do not rename variant_label
to member_id
- that is exactly the confusion we are trying to avoid (created by intake-esm
). Intake-esm will only use the CSV file - so when we generate the CSV from BigQuery, we will put variant_label
in the member_id
column and subexperiment_id
in the dcpp_init_year
column.
Thanks for the clarification, Naomi! I added member_id
back in as a required field, and specified that it is a compound of the sub_experiment_id
and variant_label
.
What's the status here. Is this done? Have you exported the catalog to bigquery?
Not yet - I'm not sure how to proceed with populating the dataset_tracking_id
or netcdf_tracking_ids
columns. Are these preexisting values already associated with the datasets or something to be generated based on location of the datasets/their underlying netCDF files?
@naomi-henderson - can you help @charlesbluca answer this question?
@charlesbluca , the netcdf_tracking_ids
were added to the metadata when saving the zarr stores. So now it will need to be parsed from the zarr metadata. You can see how to do it by looking at the get_version
function in my 'cmip6collect' repository: identify.py
The function returns both the version
and the dataset_tracking_id
of any zstore
url by reading the netcdf_tracking_ids
from the metadata and looking up the dataset_tracking_id
from a data handle site (http://hdl.handle.net/). Although I use this function to determine the version
of the dataset, you can see how it works.
Note that there should be a 1-1 correspondence between version
and dataset_tracking_id
. Unfortunately, the modelers did not always follow directions and there is not always a unique dataset_tracking_id
and/or version
for each set of netcdf_tracking_ids
. In this case we will need to put a flag in the bigquery database indicating that it is ambiguous.
I sent you email today about my zoom open office hours - just stop by and ask if you have any questions about this
Thanks Naomi! So far the pipeline to importing the CSV to BigQuery is:
dataset_tracking_ids
are ambiguous (which could be represented as a boolean added to the schema?)For now I'll work with the scripts to begin populating a table with the relevant tracking IDs - for clarity @naomi-henderson, would a dataset_tracking_id
qualify as ambiguous if it were repeated for multiple version
s? Or if a set of netcdf_tracking_ids
corresponds to more than one dataset_tracking_id
?
This pipeline looks good. As far as marking a dataset_tracking_id
as ambiguous - both of your scenarios occur in CMIP6. For our purposes, they could both cause confusion, so should be marked as ambiguous.
For a faster experience, you can bypass xarray when opening the stores and just call group = zarr.open_consolidated(mapper)
. All the metadata are still available via group.attrs
Also, once we make this change, we will want to switch Naomi's workflow to update the bigquery table, rather than the csv file, when she creates a new dataset. The bigquery table will now be the master source of truth. We can make a cron job to export a nightly csv copy.
@rabernat , sounds good - always more fun to learn something new
Hopefully Charles can share some python code that inserts a record into the table to help you get started.
Hey @naomi-henderson, a question about this section of your code (which is a big help!):
if ';' in dataset_tracking_id:
# multiple dataset_ids erroneously reported
dtracks = dataset_tracking_id.split(';')
vs = []
for dtrack in dtracks:
url2 = baseurl + dtrack[4:] + query2
r = client.get(url2)
r.raise_for_status()
r.json()['values'][0]['data']['value']
vs += [r.json()['values'][0]['data']['value']]
v = sorted(vs)[-1]
In a case where we get multiple dataset IDs reported, would we want to list all of them in the table or just go with whichever one corresponds to the most recent version?
Ah, we should not be getting multiple dataset IDs, so we need to make a decision about how to flag this. In this snippet of code, I am just using the latest version because this function was written for a notebook which checks to see if there is a new version of an old dataset that should be replaced. If I used the latest version, and that is all that is available at ESGF, it does not need to be downloaded. But this is not necessarily what we want for here.
I was thinking that in the BigQuery database we should probably list the dataset ID as ambiguous and then stop - so the loop over dtracks
can be omitted (unless you are putting the version in the new database, in which case it should also go in as ambiguous).
That makes sense - in that case, maybe a keyword such as ambiguous
would work in place of dataset_tracking_id
for these cases, instead of a flag to denote the ambiguity of a known tracking ID.
Also, is the intent of this BigQuery migration to move along all the CMIP csv files to their own corresponding tables in a database, or just a select few? The schema we've been drafting reflects the columns of pangeo-cmip6.csv
, but I notice the much larger cmip6.csv
has other columns added in. If we wish to import those as well, I'm happy to draft out additional schemas for those tables (once I can reliably import one into BigQuery!).
@charlesbluca , it would be great if you generate the other csv files with additional columns! But a good first step would be to work with pangeo-cmip6.csv
as you are doing. Once we can actually re-generate this one from BigQuery, then we could try to add the columns needed for the other csv files. I mostly use https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv
since it has all of the information about which datasets have ES-DOC
errata issues. The cmip6.csv
hasn't been updated in awhile because it is much more time-consuming to update - but it is also very useful, with start and stop times as well as the size of the zarr store.
Great progress, thanks!
cmip6-zarr-consolidated-stores-noQC.csv
has been imported and should now be viewable through the BigQuery console. As a side note, since the JSON Lines file used to create this table was too large for a local file import, I uploaded it to gs://cmip6
, where it can reside among its corresponding CSV. Other issues to tackle from here:
cmip6.csv
Thanks for all the help! I'm going to rename the schema in question to match up with its table name (cmip6_zarr_consolidated_stores_noQC
) and move it to a dedicated directory so we can open up the possibility of additional schemas.
Thanks for all the work on this. It seems like great progress.
As a reminder of what I think the overall path is here. We want the bigquery table to become our master source of truth about all the CMIP6 data in GCS. Ideally we don't need multiple tables. Instead, we can generate the existing different csv files as queries against the master table.
Some specific steps that need to happen are:
First attempt at defining a BigQuery table schema for our CMIP6 data, based on
cmip6-zarr-consolidated-stores-noQC.csv
. Some observations:dcpp_start_year
as an integer field, but noticed all the dates were formatted as floats (i.e.1989.0
); is there a potential need for float based values here?serious
issues defined by:severity
?good
orresolved
forstatus
?version
as an integer field, as all values for this field corresponded to dates - would it be worthwhile to redefine this field as a BigQueryDATE
, keep it as is, or even redefine it as aSTRING
to broaden the field?