Toward a general-purpose API response data tidier

aedobbyn commented 6 years ago

I think the goal for the unconf would be to lay the foundation for an eventual package that is meant to sit in a user’s pipeline directly after a jsonlite::fromJSON() call, in place of initial bespoke munging of the nested list. The package would:

Tidy the nested list output into a tibble with one nested row per input
- Will be a best guess at how the user wants data formatted; at worst it should be easier to work with than the raw output
Allow for recursively filling .empty values at each level of the list (i.e., convert empty elements and NULLs to NAs or a user-defined value) so the tibble can be easily unnested
Include special handling for the most commonly used APIs so we know we’re getting these right (?)

The first step here (which is the goal for the unconf -- many thanks to @jennybc for working through the initial idea with me) seems to me to be defining patters in 1) API requests and their resulting responses and 2) in the most common/successful strategies people use in their tidying process. I think it would be useful to query a few more-or-less representative RESTful APIs and note the commonalities in the solutions for tidying them. The idea would be to extract the intersection of these solutions into general-purpose verbs and also to identify where these approaches fail.

I could see this package being useful not just for one-off data grabbing and tidying jobs but also for developing packages that interface with an arbitrary API. Could of course be used on any nested list, but I think it makes sense to keep the scope of the package focused to API data.

For a name, I’m thinking roomba[^1] but definitely not married to it.

[^1]: Provided that’s cool with the relevant trademark attorneys :satisfied:

jimhester commented 6 years ago

The GitNub graphQL API is one common source of these wonderfully nested lists, likely any graphQL based API would produce similar nested outputs.

I have found the data pretty cumbersome to work with as nested lists, so definitely think this is an area where additional tooling or a different data representation like you describe would help.

aedobbyn commented 6 years ago

Awesome thanks for pointing to that @jimhester!

cboettig commented 6 years ago

This sounds fantastic, I'm definitely interested in strategies for working with these deeply nested lists. @sckott , @maelle and I have recently been playing with jq in this context; e.g. https://ropensci.org/blog/2018/04/26/rectangling-onboarding/. jq is cool and all but I think it would be great to have a more native R interface for doing these kind of queries (analogous to dplyr instead of SQL).

I'm also curious about other query strategies for deeply nested JSON (i.e. other than jq). I've found some cool thinks can be done using the functions from @jeroen's jsonld package, and have also played around using SPARQL to do graph queries on JSON files (eg https://ropensci.github.io/rdflib/articles/rdf_intro.html). Again both of these approaches help do things that can be cumbersome with purrr alone but could use some nicer interfaces I think. SPARQL queries can be tricky to write (at least for me, I struggle with pure SQL queries too), but the nice thing is the return type is essentially always a data.frame, so no additional rectangling needed once you know how to ask for what you want.

I've also been meaning to dive more into @hrbrmstr's excellent little book, https://github.com/hrbrmstr/drill-sergeant-rstats, on using Apache Drill from R to query large numbers of deeply nested JSON files.

Both the SPARQL and Drill approaches can optionally communicate over ODBC protocols, which should make some interesting potential for integration with dplyr / dbplyr and the great work folks at RStudio have been doing with database connections. Not really sure what is possible in that regard but would love to explore more.

jimhester commented 6 years ago

Agreed, it would be great to have something you wouldn't need to learn a new DSL for, particularly because there doesn't seem to be one consensus standard DSL used to query JSON like there is for XML (XQuery).

sckott commented 6 years ago

Its possible we could build something on top of jqr - since it is very fast, and jeroen is working on streaming, so it could be used for those large data problems as well. I think we could build in operations like replacing empty/NULL values with NA, and similar nasty problems with nested data.

hrbrmstr commented 6 years ago

I like the jqr suggestion (Drill, et. al. is a pretty heavyweight dep), esp if we build filters / filter helpers (I'm not really a fan of the syntax b/c it messes with my own mental model of how it ought to work but that's not really a big issue and it's a great tool).

aedobbyn commented 6 years ago

Building something on top of jqr sounds useful, rather than doing all of the munging in an R object. Even if we only get so far as sniffing out and NAifying NULLs and making it easier to flatten nested data in a way that usually makes sense, I think that's a win.

+1 for abstracting away potentially multiple DSLs for querying JSON.

@hrbrmstr I don't know anything about Drill but fwiw the Preface of your book is fantastic.

cboettig commented 6 years ago

Agree with all of the above! @sckott and @aedobbyn both identify the nuisance of empty, and I agree with @hrbrmstr 's observation that the syntax isn't super intuitive. Would be nice to have a more dplyr-esque DSL, e.g. something more like

jsonblob %>% filter(repo.owner.name == "hrbrmstr")

Maybe it's already obvious, but just wanted to note that I think this would provide a nice way to query R lists in general (since there's a pretty tight JSON<->list mapping); such an interface isn't specific to just data already in JSON or just REST APIs.

aedobbyn commented 6 years ago

Exactly!

Originally I'd been thinking of this at the level of something that's already an R object but riddled with empty vectors etc. so workflow would look like

jsonblob %>% 
  jsonlite::fromJSON() %>% 
  roomba::clean()

or something to that effect. But if doing the cleaning or the filtering or whtever it may be on the jsonblob itself makes more sense then I'm all for that.

I still think a useful first step will be in identifying common patterns and stumbling blocks in nested structures like API response objects and constructing techniques for cleaning them in ways that are safe make sense in the majority of cases.

And totally agreed @cboettig I think we can tackle a lot of nasty nested list cases for free by addressing the API data specifically. (Maybe it's also worth identifying other common list annoyances that don't manifest in API response data, though.)

cboettig commented 6 years ago

Just spitballing a bit further on this. Not sure how common this case is for others, but often there's a JSON object/blob with some pattern I understand and want to extract without having to make explicit references to just how deep it is buried. For instance, something like the example below:

{
"stuff": [{
               "buried": [
                               "deep": [
                                             {
                                              "goodstuff": "here",
                                              "name": "Bob Rudis",
                                              "secret_power": ..., 
                                             },
                                             {
                                              "goodstuff": "here",
                                              "name": "Amanda Dobbyn",
                                              "secret_power": ..., 
                                              "more_nested_stuff": { ... }
                                             },

                                             ]
                                ]
              },
              {
               ...
              }
              ]

}

Maybe I just want the value of the names of anything found in JSON objects that also contain the key-value pair "goodstuff": "here". More generally, I'd like to just jump to that point in the "graph" and navigate from there, e.g. maybe I want to go up to the parent object and get some additional property, etc.

Maybe this can be done in JQ already, but it's not intuitive to me how to express it. (particularly if I want to start "walking the graph" from that point without specifying just how I descended down to that node... My ideal interface would let me have some notion of variables, and let me return all matches to the requested variables in columns of a data.frame.

(Not sure if the above is at all helpful as illustration, but will bring some real examples of highly nested json blobs I struggle with for fodder for an API design)

hrbrmstr commented 6 years ago

that made me think of something fairly obscure: "GraphGrep" (https://cs.nyu.edu/shasha/papers/graphgrep/icpr2002.pdf) and a thing that kinda built on it "Closure Trees" (https://www.cs.ucsb.edu/~dbl/papers/he_icde_2006.pdf). It does require building a graph and index but for sufficiently large structures it might be worth it. For smaller ones that index build time might be small enough. @yonicd has been layering in some recursion on tidyverse bits but we cld also likely scaffold a query language on to rapply() if jq cannot do it.

sckott commented 6 years ago

just having a play @cboettig with your eg above cleaned up:

x <- '{
"stuff": {
   "buried": {
      "deep": [
       {
        "goodstuff": "here",
        "name": "Bob Rudis",
        "secret_power": 5
       },
       {
        "goodstuff": "here",
        "name": "Amanda Dobbyn",
        "secret_power": 4, 
        "more_nested_stuff": 4
        }
      ],
      "alsodeep": 2342423234,
      "deeper": {
        "foo": [
          {
            "goodstuff": 5,
            "name": "barb"
          }
        ]
      }
    }
}}'

jqr::jq(x, 'recurse(.[]?) | objects | select(has("goodstuff"))')
[
    {
        "goodstuff": "here",
        "name": "Bob Rudis",
        "secret_power": 5
    },
    {
        "goodstuff": "here",
        "name": "Amanda Dobbyn",
        "secret_power": 4,
        "more_nested_stuff": 4
    },
    {
        "goodstuff": 5,
        "name": "barb"
    }
]

the jq command 'recurse(.[]?) | objects | select(has("goodstuff"))' is clearly not something we'd want a end user to have to deal with, but could be hidden under some nicer syntax 😼

cboettig commented 6 years ago

Thanks @sckott , that's awesome ✨ . And thanks for getting us to a reproducible example.

Minor question first: that matches all blobs with the property goodstuff, it would be nice to match conditionally, so we only get the blob if "goodstuff": "here", (i.e. barb should not be included).

One thing that I struggle with in JQ is that it's not clear how to walk up and down the graph though -- imagine I want the number from the alsodeep field as an id for everything in that blob? And maybe I also want to descend deeper from the goodstuff and pull out some other element, like more_nested_stuff.b (if more_nested_stuff actually had more nesting in it, as in my modified toy example.

The syntax is super awkward, but it's possible to do this kind conditional filter and walking with SPARQL variables. The query:

q <- '
PREFIX x: <x:>
SELECT ?name, ?power ?b ?id
WHERE { 
  ?object x:goodstuff "here" .
  ?object x:name ?name .
  OPTIONAL { ?object x:secret_power      ?power } .
  OPTIONAL { ?object x:more_nested_stuff ?y .
             ?y      x:b                 ?b} .
  OPTIONAL {
             ?parent x:deep ?object .
             ?parent x:alsodeep ?id
  }
}'
rdf_query(blob, q)

Returns a nice compact tibble containing only the columns named in SELECT. (note that anything proceeded by a ? is a variable, and we can call it wahtever we want. Note that variables can be used to indicate the object (?object, ?parent, the key (not shown), or the value ?power, ?id, ?y)

# A tibble: 3 x 4
  name          power     b          id
  <chr>         <int> <int>       <dbl>
1 scott            NA    NA         NA 
2 Amanda Dobbyn     4     2 2342423234.
3 Bob Rudis         5    NA 2342423234.

Note that barb is absent (because she doesn't have "goodstuff": "here", and scott is included. Dropping the OPTIONAL bits would cause rows missing those elements to be omitted.

I know the SPARQL syntax is basically 💩 (particularly with the ugly prefix stuff), but it is perhaps amenable to being broken down into a dplyr like syntax in much the way dplyr already does for SQL? I'm not sure that we really want to use rdflib, but I am intrigued by the ability to walk the graph with variables and select and filter with SQL like statements...

cboettig commented 6 years ago

(p.s. meant to include reproducible code link as gist)

jimhester commented 6 years ago

Alternatively we could just use R to walk the list, in tidyversedashboard I wrote a simple function to do a depth first search of a nested list returning the index of the first location that matched a predicate function. Extending this to return indexes of all the matches rather than only the first would solve this problem, here is the example with the current implementation which returns only the first match.

dfs_idx <- function(x, f) {
  res <- integer()
  walk <- function(x, depth) {
    for (i in seq_along(x)) {
      if (isTRUE(tryCatch(f(x[[i]]), error = function(e) FALSE))) {
        res[[depth]] <<- i
        return(TRUE)
      } else {
        if (is.list(x[[i]]) && isTRUE(walk(x[[i]], depth + 1))) {
          res[[depth]] <<- i
          return(TRUE)
        }
      }
    }
  }
  walk(x, 1)
  res
}

x <- jsonlite::fromJSON('{
"stuff": {
   "buried": {
      "deep": [
       {
        "goodstuff": "here",
        "name": "Bob Rudis",
        "secret_power": 5
       },
       {
        "goodstuff": "here",
        "name": "Amanda Dobbyn",
        "secret_power": 4, 
        "more_nested_stuff": 4
        }
      ],
      "alsodeep": 2342423234,
      "deeper": {
        "foo": [
          {
            "goodstuff": 5,
            "name": "barb"
          }
        ]
      }
    }
  }
}', simplifyVector = FALSE)

x[[dfs_idx(x, function(x) { x$goodstuff == "here" })]]$name
#> [1] "Bob Rudis"

Created on 2018-05-19 by the reprex package (v0.2.0).

cboettig commented 6 years ago

@jimhester 🎉 yup, a native R solution sounds much more sensible, really like this example. A library of such functions sound like it could be super useful. Obviously we have purrr, but I almost seems surprising that we don't have any built-in recursive tree/search algorithms like this, isn't it? Would it make sense to have a C implementations of these? (I never really learned how to do recursive searches properly and hate writing my own versions of these; usually try to just steal from stuff @jimhester already has in xml2, e.g. as_list).

I think this would still leave open the issue of an expressive syntax to go from ugly list to nice data.frame. In particular, one thing I still struggle with is being able to assemble data from different 'depths' in the tree, e.g. go back up from where the good stuff is to get an identifier. e.g.

obj <- x[[dfs_idx(x, function(x) { x$goodstuff == "here" })]

## works:
obj$name 
obj$secret_power

## but what if we could also 'bump up the stack'
##
obj$$alsodeep

(imagine $$ is to .. as $ is to .) Or for another metaphor, like the way we can refer to a parent node as well as a child node in XPATH, or the graph-based query in SPARQL (or maybe graphql etc).

jimhester commented 6 years ago

So one nice thing is the index is actually a vector of indices, so to access the parents you just remove the last one from the index, e.g.

idxs <- dfs_idx(x, function(x) { x$goodstuff == "here" })
idxs
#> [1] 1 1 1 1
parent_idx <- tail(idxs, n = -1)

x[[parent_idx]]

I agree it would be nice to have these functions in purrr ideally with a C implementation. If we do want to go down this route we can discuss a nice API / helper functions to make everything smoother, I think in combination with purrr::map_dfr() and friends we could make something pretty nice that would reuse a lot of the existing idioms we already have.

cboettig commented 6 years ago

@jimhester I had missed that, that is indeed very nice! Good to get some wider input while we're together and hash out what a nice API based around familiar purrr or dplyr idioms would be fantastic.

Could you see such a C implementation getting onto the dev roadmap of the tidyverse team? Guessing that part would be beyond the scope of what we could get done in Seattle...

jimhester commented 6 years ago

We could do the C implementation at the unconf, but our time I think would be better spent on generating more good problem cases (like you already did) and hashing out the API.

But I know Jenny and I care better ways of dealing with these ragged arrays, so I am sure once we have the API we will be able to work on an efficient implementation

stefaniebutland commented 6 years ago

Unconf project repo: https://github.com/ropenscilabs/roomba

ropensci / unconf18

Toward a general-purpose API response data tidier #24