mahmoud / glom

☄️ Python's nested data operator (and CLI), for all your declarative restructuring needs. Got data? Glom it! ☄️
https://glom.readthedocs.io
Other
1.88k stars 61 forks source link

Global options for missing values #239

Open deeenes opened 2 years ago

deeenes commented 2 years ago

Hi,

Thank you for developing this great library!

I have a question about dealing with missing values. See below an example:

import json
from urllib import request
from glom import glom, Coalesce

url = 'https://www.ebi.ac.uk/ols/api/ontologies/efo/terms?size=200'

with request.urlopen(url) as r:

    data = json.loads(r.read())

# 1: no missing values, result is a dict of lists, each 200 long
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
}

result = glom(data, spec)

# 2: few missing values in "children", result is a single None
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
    'parents': ('_embedded.terms', ['_links.parents.href']),
    'children': ('_embedded.terms', ['_links.children.href']),
}

result = glom(data, spec, default = None)

# 3: the desired result: the missing values in "children" are replaced by None's
spec = {
    'label': ('_embedded.terms', ['label']),
    'obo_id': ('_embedded.terms', ['obo_id']),
    'parents': ('_embedded.terms', ['_links.parents.href']),
    'children': (
        '_embedded.terms',
        [Coalesce('_links.children.href', default = None)]
    ),
}

result = glom(data, spec)

The third version above is a solution for me: all lists in the result are the same length, no records are dropped, and None is used in place of the missing values. However, this interface is quite inconvenient, as I would need to wrap everything into Coalesce(..., default = None). I am wondering if a better solution exists, where with one single parameter I can set the missing value handling globally?

kurtbrose commented 2 years ago

Sorry for the slow reply!

One general readability thing -- you can move the cursor down to '_embedded.terms' once outside the dict rather than as part of deriving each value:

spec = (
  '_embedded.terms',
  {
    'label': ['label'],
    'obo_id': ['obo_id'],
    'parents': ['_links.parents.href'],
    'children': [Coalesce('_links.children.href', default = None)],
  }
)

One approach you could take is to stay explicit, and save typing on Coalesce by using Or, which has the same defaulting behavior.

def _or_none(path):
   return Coalesce(path, default=None)

spec = (
  '_embedded.terms',
  {
    'label': [Or('label', default=None)],
    'obo_id': [Or('obo_id', default=None)],
    'parents': [Or('_links.parents.href', default=None)],
    'children': [Or('_links.children.href', default = None)],
  }
)

Another approach you could take is to embrace that specs are basic python data structures, and write a helper function to do the "boring stuff".

def get_paths_in_list(path_dict, default=None):
   '''given a dict of {key: path}, returns a spec that fetches that path with a default from each child'''
   return {key: [Or(val, default=default)] for key, val in path_dict.items}

spec = (
  '_embedded.terms',
  get_paths_in_list({
    'label': 'label',
    'obo_id': 'obo_id',
    'parents': '_links.parents.href',
    'children': '_links.children.href',
  })
)