pangeo-forge / cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline
Apache License 2.0
1 stars 5 forks source link

First pass at BigQuery table schema #3

Closed charlesbluca closed 4 years ago

charlesbluca commented 4 years ago

First attempt at defining a BigQuery table schema for our CMIP6 data, based on cmip6-zarr-consolidated-stores-noQC.csv. Some observations:

rabernat commented 4 years ago

Fantastic @charlesbluca. Thanks so much for getting this started. I think we'll want @naomi-henderson to review.

One open question is whether we need any array type columns.

naomi-henderson commented 4 years ago

@charlesbluca , all very good questions! Here are my attempts at answers:

Put datasets back as they are fixed in GCS

dz_exclude = dz_exclude[~((dz_exclude.issue_uid == 'eb69632c-a6e2-7667-a112-a98b7745e2ea') &(dz_exclude.member_id == 'r4i1p1f1') &(dz_exclude.experiment_id == 'ssp370') &(dz_exclude.table_id == 'Amon'))]


- Please use 'string' for the version.  The controlled vocabulary actually prepends a 'v', as in 'v20190204' (though at one point one model used 'v1').
naomi-henderson commented 4 years ago

Two array type columns we might want are for tracking_id (which is currently a concatenated list of the tracking_ids from all data combined into the zarr store) and the related column suggested by @rabernat, source_URLs.

naomi-henderson commented 4 years ago

Oh, there is another column that would be useful, perhaps called dataset_id. Each dataset - which corresponds to one 8-tuple + version - has a unique dataset tracking_id, which is different from a netcdf tracking_id. In a publication one would use this dataset tracking_id to reference all years of the same variable of a particular version of a run rather than a list of all of the composite netcdf tracking_ids (which can be ridiculously long).

rabernat commented 4 years ago

Each dataset - which corresponds to one 8-tuple + version - has a unique dataset tracking_id, which is different from a netcdf tracking_id

So how about

  {
    "description": "Unique ID to reference all years of the same variable of a particular version of a run",
    "mode": "REQUIRED",
    "name": "dataset_tracking_id",
    "type": "STRING"
  },
  {
    "description": "All of the netCDF tracking IDs that have gone into this record",
    "mode": "REPEATED",
    "name": "netcdf_tracking_ids",
    "fields": [
            {
                "name": "netcdf_tracking_id",
                "type": "STRING",
                "mode": "NULLABLE"
            } ]
  }

I based this guess on this example.

naomi-henderson commented 4 years ago

I hesitate to introduce any more special considerations, but better to get it correct from the beginning. It has to do with the dcppA-hindcast experiments which are prediction experiments with each run starting at many different initialization years.

Prior to the hackathon, the intake-esm folks decided to split the member_id into two parts ('member_id' and 'dcpp_init_year') to aid in the parsing of dcppA-hindcast datasets. Unfortunately they used the same name for the column as for the original member_id keyword. For example, the following dataset has a member_id = 's1961-r1i1p1f1', but it is given in the CSV file with the values:

zstore = `gs://cmip6/DCPP/BCC/BCC-CSM2-MR/dcppA-hindcast/s1961-r1i1p1f1/Amon/tas/gn/` 
member_id = 'r1i1p1f1'
dcpp_init_year = 1961

Those interested in this experiment will probably want to use this extra intake-esm functionality and take advantage of dcpp_init_year, so we probably shouldn't change this now. However, we should rethink this for the BigQuery database. I would rather do it correctly.

Here is the CV description of member_id:

if sub_experiment_id = “none”
      member_id = <variant_label>
else
      member_id = <sub_experiment_id>-<variant_label>
endif

To do this correctly, the BigQuery database would use the original member_id with optional sub_experiment_id and variant_label columns. The CSV file can then be parsed from this, using the intake-esm re-naming. Note that variant_label is member_id in the CSV file and sub_experiment is either "none" or "s????" where "????" is the dcpp_init_year in the CSV.

rabernat commented 4 years ago

@charlesbluca - are you clear on how to revise your schema based on @naomi-henderson's feedback? If so, could you update your PR?

I'd like to get this moving forward.

charlesbluca commented 4 years ago

I believe so - I've switched from the required member_id to a nullable sub_experiment_id and required variant_label, and added dataset_tracking_id and netcdf_tracking_ids as additional fields.

I think we can get away with defining netcdf_tracking_ids as a simple repeated string field:

  {
    "description": "All of the netCDF tracking IDs that have gone into this record",
    "mode": "REPEATED",
    "name": "netcdf_tracking_ids",
    "type": "STRING"
  }

Unless we plan to have associated metadata for each individual tracking ID, in which case an array of records would be more fitting.

naomi-henderson commented 4 years ago

I was thinking of adding the two new columns, subexperiment_id and variant_label, in addition to member_id, but perhaps this is better? Are we trying to avoid redundancy be eliminating the member_id column? ( = -)? To construct the paths, both columns must be read and concatenated to construct the member_id which is used in the path/names of the zarr stores.

@charlesbluca , I don't know of any associated metadata for netcdf_tracking_ids - so that looks fine to me.

charlesbluca commented 4 years ago

That was a misunderstanding on my part - wasn't sure if you wanted to be more true to the CV description by eliminating it in favor of the two fields. If the prime concern is compatibility with intake-esm, would a suitable solution to be rename the variant_label column to member_id, and clarify in its description that it is the variant_label by the CV description?

naomi-henderson commented 4 years ago

Ah, the true CV description has all three keywords. No, please do not rename variant_label to member_id - that is exactly the confusion we are trying to avoid (created by intake-esm). Intake-esm will only use the CSV file - so when we generate the CSV from BigQuery, we will put variant_label in the member_id column and subexperiment_id in the dcpp_init_year column.

charlesbluca commented 4 years ago

Thanks for the clarification, Naomi! I added member_id back in as a required field, and specified that it is a compound of the sub_experiment_id and variant_label.

rabernat commented 4 years ago

What's the status here. Is this done? Have you exported the catalog to bigquery?

charlesbluca commented 4 years ago

Not yet - I'm not sure how to proceed with populating the dataset_tracking_id or netcdf_tracking_ids columns. Are these preexisting values already associated with the datasets or something to be generated based on location of the datasets/their underlying netCDF files?

rabernat commented 4 years ago

@naomi-henderson - can you help @charlesbluca answer this question?

naomi-henderson commented 4 years ago

@charlesbluca , the netcdf_tracking_ids were added to the metadata when saving the zarr stores. So now it will need to be parsed from the zarr metadata. You can see how to do it by looking at the get_version function in my 'cmip6collect' repository: identify.py

The function returns both the version and the dataset_tracking_id of any zstore url by reading the netcdf_tracking_ids from the metadata and looking up the dataset_tracking_id from a data handle site (http://hdl.handle.net/). Although I use this function to determine the version of the dataset, you can see how it works.

Note that there should be a 1-1 correspondence between version and dataset_tracking_id. Unfortunately, the modelers did not always follow directions and there is not always a unique dataset_tracking_id and/or version for each set of netcdf_tracking_ids. In this case we will need to put a flag in the bigquery database indicating that it is ambiguous.

I sent you email today about my zoom open office hours - just stop by and ask if you have any questions about this

charlesbluca commented 4 years ago

Thanks Naomi! So far the pipeline to importing the CSV to BigQuery is:

For now I'll work with the scripts to begin populating a table with the relevant tracking IDs - for clarity @naomi-henderson, would a dataset_tracking_id qualify as ambiguous if it were repeated for multiple versions? Or if a set of netcdf_tracking_ids corresponds to more than one dataset_tracking_id?

naomi-henderson commented 4 years ago

This pipeline looks good. As far as marking a dataset_tracking_id as ambiguous - both of your scenarios occur in CMIP6. For our purposes, they could both cause confusion, so should be marked as ambiguous.

rabernat commented 4 years ago

For a faster experience, you can bypass xarray when opening the stores and just call group = zarr.open_consolidated(mapper). All the metadata are still available via group.attrs

rabernat commented 4 years ago

Also, once we make this change, we will want to switch Naomi's workflow to update the bigquery table, rather than the csv file, when she creates a new dataset. The bigquery table will now be the master source of truth. We can make a cron job to export a nightly csv copy.

naomi-henderson commented 4 years ago

@rabernat , sounds good - always more fun to learn something new

rabernat commented 4 years ago

Hopefully Charles can share some python code that inserts a record into the table to help you get started.

charlesbluca commented 4 years ago

Hey @naomi-henderson, a question about this section of your code (which is a big help!):

 if ';' in dataset_tracking_id:
            # multiple dataset_ids erroneously reported
            dtracks = dataset_tracking_id.split(';')
            vs = []
            for dtrack in dtracks:
                url2 = baseurl + dtrack[4:] + query2
                r = client.get(url2)
                r.raise_for_status()
                r.json()['values'][0]['data']['value']
                vs += [r.json()['values'][0]['data']['value']]
            v = sorted(vs)[-1]    

In a case where we get multiple dataset IDs reported, would we want to list all of them in the table or just go with whichever one corresponds to the most recent version?

naomi-henderson commented 4 years ago

Ah, we should not be getting multiple dataset IDs, so we need to make a decision about how to flag this. In this snippet of code, I am just using the latest version because this function was written for a notebook which checks to see if there is a new version of an old dataset that should be replaced. If I used the latest version, and that is all that is available at ESGF, it does not need to be downloaded. But this is not necessarily what we want for here.

I was thinking that in the BigQuery database we should probably list the dataset ID as ambiguous and then stop - so the loop over dtracks can be omitted (unless you are putting the version in the new database, in which case it should also go in as ambiguous).

charlesbluca commented 4 years ago

That makes sense - in that case, maybe a keyword such as ambiguous would work in place of dataset_tracking_id for these cases, instead of a flag to denote the ambiguity of a known tracking ID.

Also, is the intent of this BigQuery migration to move along all the CMIP csv files to their own corresponding tables in a database, or just a select few? The schema we've been drafting reflects the columns of pangeo-cmip6.csv, but I notice the much larger cmip6.csv has other columns added in. If we wish to import those as well, I'm happy to draft out additional schemas for those tables (once I can reliably import one into BigQuery!).

naomi-henderson commented 4 years ago

@charlesbluca , it would be great if you generate the other csv files with additional columns! But a good first step would be to work with pangeo-cmip6.csv as you are doing. Once we can actually re-generate this one from BigQuery, then we could try to add the columns needed for the other csv files. I mostly use https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv since it has all of the information about which datasets have ES-DOC errata issues. The cmip6.csv hasn't been updated in awhile because it is much more time-consuming to update - but it is also very useful, with start and stop times as well as the size of the zarr store.

Great progress, thanks!

charlesbluca commented 4 years ago

cmip6-zarr-consolidated-stores-noQC.csv has been imported and should now be viewable through the BigQuery console. As a side note, since the JSON Lines file used to create this table was too large for a local file import, I uploaded it to gs://cmip6, where it can reside among its corresponding CSV. Other issues to tackle from here:

Thanks for all the help! I'm going to rename the schema in question to match up with its table name (cmip6_zarr_consolidated_stores_noQC) and move it to a dedicated directory so we can open up the possibility of additional schemas.

rabernat commented 4 years ago

Thanks for all the work on this. It seems like great progress.

As a reminder of what I think the overall path is here. We want the bigquery table to become our master source of truth about all the CMIP6 data in GCS. Ideally we don't need multiple tables. Instead, we can generate the existing different csv files as queries against the master table.

Some specific steps that need to happen are:

  1. Refactor @naomi-henderson's data processing routines to query / update biquery rather than CSVs.
  2. Create a cron job that generates the CSVs nightly from bigquery. (We need to do keep the CSVs up to date for backwards compatibility and interoperability--not everyone can read bigquery.)
  3. Once the bq table is the official source of truth, we need to move it out of our project and into the cmip6 project. This will require talking to google and setting up the appropriate service accounts.