mintproject / MINT-DataCatalog-Public

Public-facing aspects of data catalog, such as documentation, demos, tracking issues, and feature requests
Apache License 2.0
1 stars 1 forks source link

Consolidate dataset, standard name, and variable data #4

Open brandomr opened 5 years ago

brandomr commented 5 years ago

Overview

This issue proposes a consolidated approach to structuring DCAT query results such that each dataset that is returned by DCAT includes a full picture of the dataset, including all available information about each variable and each variable's associated standard variable information.

Current State

In order to gain a full picture of a dataset, you currently must:

  1. Query for the dataset by ID to obtain its variables
  2. Query for the standard variables related to the dataset
  3. For each variable, query for the associated standard variable
  4. Consolidate results

Example of Current State Process

For example, we can first query for a dataset.:

q = {
  "dataset_id": '00ee1157-bbff-4625-852d-e010f44679e4'
}

resp = requests.post(f"{url}/datasets/dataset_variables",
                     headers=request_headers,
                     json=q).json()

dataset = resp['dataset']

This will return a dataset and for each variable within the dataset, the variable name, metadata, and a variable ID, but not standard variable information.

So next, we must find standard variables associated with the dataset:

resp = requests.post(f"{url}/datasets/dataset_standard_variables",
                     headers=request_headers,
                     json=q).json()
std_vars = resp['dataset']['standard_variables']
std_vars_dict = {}
for var in std_vars:
    std_vars_dict[var['standard_variable_id']] = var

This query returns the standard variables associated with a dataset, but not the variables each of those standard variables relates to. We have stored each standard variable's information (name and URI) in a lookup dictionary where we can pull the information on the standard variable using its ID as a key.

Now, for each variable within the dataset, we have to find its associated standard variables:

for v in dataset['variables']:
    # Obtain standard name information for the variable  

    q = {
      "variable_ids__in": [v['variable_id']]
    }

    resp = requests.post(f"{url}/variables/variables_standard_variables",
                         headers=request_headers,
                         json=q).json()

    v_ = resp['variables'][0]
    v['standard_names'] = []
    for std in v_['standard_variables']:
        std_var = std_vars_dict[std['standard_variable_id']]
        v['standard_names'].append(std_var)

Finally, this outputs a consolidated dataset which includes variable information and each variable's associated standard variable information:

{'dataset_id': '00ee1157-bbff-4625-852d-e010f44679e4',
 'dataset_name': 'DSSAT Simplified Input Data',
 'variables': [{'variable_id': '0b7d0908-ff14-4b8a-a082-0cbf0eca27c4',
   'variable_name': 'fractionalAW',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': 'f7d62db8-a470-503a-80d3-c987181c6ca8',
     'standard_variable_name': 'moisture',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/attribute#moisture'}]},
  {'variable_id': '0fd06920-e19f-49ad-bdf4-b278bf246bd6',
   'variable_name': 'startYear',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': 'df1daca4-d727-5dc8-bfa4-fb20c717a32b',
     'standard_variable_name': 'year',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/property#year'}]},
  {'variable_id': '3b2d79be-56b7-40f7-85fa-11997fce8720',
   'variable_name': 'plantingDayOfMonth',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': '3276f43e-82a1-5caf-a627-598e6bc04503',
     'standard_variable_name': 'planting_date',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/property#planting_date'}]},
  {'variable_id': '7c3a410d-17b5-48a1-99da-d51877c319cf',
   'variable_name': 'incorporationDepth',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': '69a7996f-e953-56cb-bb9f-213456b1efab',
     'standard_variable_name': 'crop_planting__planting_depth',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/variable#crop_planting__planting_depth'}]},
  {'variable_id': '8b6d913f-5475-4006-b32e-c8b60d16d5b1',
   'variable_name': 'incorporationRate',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': '18d2bc6c-c4cd-51a2-a03e-5cffe4dcef02',
     'standard_variable_name': '__planting_separation_distance',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/variable#__planting_separation_distance'}]},
  {'variable_id': '8bfff40c-ba75-4a70-87f4-8aa254c8227c',
   'variable_name': 'runYears',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': 'df1daca4-d727-5dc8-bfa4-fb20c717a32b',
     'standard_variable_name': 'year',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/property#year'}]},
  {'variable_id': 'b35db5a5-ff8d-4819-969f-77779387fa90',
   'variable_name': 'plantingWindow',
   'variable_metadata': {},
   'standard_names': [{'standard_variable_id': '3276f43e-82a1-5caf-a627-598e6bc04503',
     'standard_variable_name': 'planting_date',
     'standard_variable_uri': 'http://www.geoscienceontology.org/svo/svl/property#planting_date'}]}]}