Open aedobbyn opened 6 years ago
The GitNub graphQL API is one common source of these wonderfully nested lists, likely any graphQL based API would produce similar nested outputs.
I have found the data pretty cumbersome to work with as nested lists, so definitely think this is an area where additional tooling or a different data representation like you describe would help.
Awesome thanks for pointing to that @jimhester!
This sounds fantastic, I'm definitely interested in strategies for working with these deeply nested lists. @sckott , @maelle and I have recently been playing with jq
in this context; e.g. https://ropensci.org/blog/2018/04/26/rectangling-onboarding/. jq
is cool and all but I think it would be great to have a more native R interface for doing these kind of queries (analogous to dplyr
instead of SQL).
I'm also curious about other query strategies for deeply nested JSON (i.e. other than jq). I've found some cool thinks can be done using the functions from @jeroen's jsonld
package, and have also played around using SPARQL to do graph queries on JSON files (eg https://ropensci.github.io/rdflib/articles/rdf_intro.html). Again both of these approaches help do things that can be cumbersome with purrr
alone but could use some nicer interfaces I think. SPARQL queries can be tricky to write (at least for me, I struggle with pure SQL queries too), but the nice thing is the return type is essentially always a data.frame
, so no additional rectangling needed once you know how to ask for what you want.
I've also been meaning to dive more into @hrbrmstr's excellent little book, https://github.com/hrbrmstr/drill-sergeant-rstats, on using Apache Drill from R to query large numbers of deeply nested JSON files.
Both the SPARQL and Drill approaches can optionally communicate over ODBC protocols, which should make some interesting potential for integration with dplyr
/ dbplyr
and the great work folks at RStudio have been doing with database connections. Not really sure what is possible in that regard but would love to explore more.
Agreed, it would be great to have something you wouldn't need to learn a new DSL for, particularly because there doesn't seem to be one consensus standard DSL used to query JSON like there is for XML (XQuery).
Its possible we could build something on top of jqr
- since it is very fast, and jeroen is working on streaming, so it could be used for those large data problems as well. I think we could build in operations like replacing empty/NULL values with NA, and similar nasty problems with nested data.
I like the jqr
suggestion (Drill, et. al. is a pretty heavyweight dep), esp if we build filters / filter helpers (I'm not really a fan of the syntax b/c it messes with my own mental model of how it ought to work but that's not really a big issue and it's a great tool).
Building something on top of jqr
sounds useful, rather than doing all of the munging in an R object. Even if we only get so far as sniffing out and NA
ifying NULL
s and making it easier to flatten nested data in a way that usually makes sense, I think that's a win.
+1 for abstracting away potentially multiple DSLs for querying JSON.
@hrbrmstr I don't know anything about Drill but fwiw the Preface of your book is fantastic.
Agree with all of the above! @sckott and @aedobbyn both identify the nuisance of empty
, and I agree with @hrbrmstr 's observation that the syntax isn't super intuitive. Would be nice to have a more dplyr
-esque DSL, e.g. something more like
jsonblob %>% filter(repo.owner.name == "hrbrmstr")
Maybe it's already obvious, but just wanted to note that I think this would provide a nice way to query R lists in general (since there's a pretty tight JSON<->list mapping); such an interface isn't specific to just data already in JSON or just REST APIs.
Exactly!
Originally I'd been thinking of this at the level of something that's already an R object but riddled with empty vectors etc. so workflow would look like
jsonblob %>%
jsonlite::fromJSON() %>%
roomba::clean()
or something to that effect. But if doing the clean
ing or the filter
ing or whtever it may be on the jsonblob
itself makes more sense then I'm all for that.
I still think a useful first step will be in identifying common patterns and stumbling blocks in nested structures like API response objects and constructing techniques for cleaning them in ways that are safe make sense in the majority of cases.
And totally agreed @cboettig I think we can tackle a lot of nasty nested list cases for free by addressing the API data specifically. (Maybe it's also worth identifying other common list annoyances that don't manifest in API response data, though.)
Just spitballing a bit further on this. Not sure how common this case is for others, but often there's a JSON object/blob with some pattern I understand and want to extract without having to make explicit references to just how deep it is buried. For instance, something like the example below:
{
"stuff": [{
"buried": [
"deep": [
{
"goodstuff": "here",
"name": "Bob Rudis",
"secret_power": ...,
},
{
"goodstuff": "here",
"name": "Amanda Dobbyn",
"secret_power": ...,
"more_nested_stuff": { ... }
},
]
]
},
{
...
}
]
}
Maybe I just want the value of the name
s of anything found in JSON objects that also contain the key-value pair "goodstuff": "here"
. More generally, I'd like to just jump to that point in the "graph" and navigate from there, e.g. maybe I want to go up to the parent object and get some additional property, etc.
Maybe this can be done in JQ
already, but it's not intuitive to me how to express it. (particularly if I want to start "walking the graph" from that point without specifying just how I descended down to that node... My ideal interface would let me have some notion of variables, and let me return all matches to the requested variables in columns of a data.frame
.
(Not sure if the above is at all helpful as illustration, but will bring some real examples of highly nested json blobs I struggle with for fodder for an API design)
that made me think of something fairly obscure: "GraphGrep" (https://cs.nyu.edu/shasha/papers/graphgrep/icpr2002.pdf) and a thing that kinda built on it "Closure Trees" (https://www.cs.ucsb.edu/~dbl/papers/he_icde_2006.pdf). It does require building a graph and index but for sufficiently large structures it might be worth it. For smaller ones that index build time might be small enough. @yonicd has been layering in some recursion on tidyverse bits but we cld also likely scaffold a query language on to rapply()
if jq
cannot do it.
just having a play @cboettig with your eg above cleaned up:
x <- '{
"stuff": {
"buried": {
"deep": [
{
"goodstuff": "here",
"name": "Bob Rudis",
"secret_power": 5
},
{
"goodstuff": "here",
"name": "Amanda Dobbyn",
"secret_power": 4,
"more_nested_stuff": 4
}
],
"alsodeep": 2342423234,
"deeper": {
"foo": [
{
"goodstuff": 5,
"name": "barb"
}
]
}
}
}}'
jqr::jq(x, 'recurse(.[]?) | objects | select(has("goodstuff"))')
[
{
"goodstuff": "here",
"name": "Bob Rudis",
"secret_power": 5
},
{
"goodstuff": "here",
"name": "Amanda Dobbyn",
"secret_power": 4,
"more_nested_stuff": 4
},
{
"goodstuff": 5,
"name": "barb"
}
]
the jq command 'recurse(.[]?) | objects | select(has("goodstuff"))'
is clearly not something we'd want a end user to have to deal with, but could be hidden under some nicer syntax 😼
Thanks @sckott , that's awesome ✨ . And thanks for getting us to a reproducible example.
Minor question first: that matches all blobs with the property goodstuff
, it would be nice to match conditionally, so we only get the blob if "goodstuff": "here"
, (i.e. barb
should not be included).
One thing that I struggle with in JQ is that it's not clear how to walk up and down the graph though -- imagine I want the number from the alsodeep
field as an id
for everything in that blob? And maybe I also want to descend deeper from the goodstuff
and pull out some other element, like more_nested_stuff.b
(if more_nested_stuff
actually had more nesting in it, as in my modified toy example.
The syntax is super awkward, but it's possible to do this kind conditional filter and walking with SPARQL variables. The query:
q <- '
PREFIX x: <x:>
SELECT ?name, ?power ?b ?id
WHERE {
?object x:goodstuff "here" .
?object x:name ?name .
OPTIONAL { ?object x:secret_power ?power } .
OPTIONAL { ?object x:more_nested_stuff ?y .
?y x:b ?b} .
OPTIONAL {
?parent x:deep ?object .
?parent x:alsodeep ?id
}
}'
rdf_query(blob, q)
Returns a nice compact tibble
containing only the columns named in SELECT
. (note that anything proceeded by a ?
is a variable, and we can call it wahtever we want. Note that variables can be used to indicate the object (?object
, ?parent
, the key (not shown), or the value ?power
, ?id
, ?y
)
# A tibble: 3 x 4
name power b id
<chr> <int> <int> <dbl>
1 scott NA NA NA
2 Amanda Dobbyn 4 2 2342423234.
3 Bob Rudis 5 NA 2342423234.
Note that barb
is absent (because she doesn't have "goodstuff": "here"
, and scott
is included. Dropping the OPTIONAL
bits would cause rows missing those elements to be omitted.
I know the SPARQL syntax is basically 💩 (particularly with the ugly prefix stuff), but it is perhaps amenable to being broken down into a dplyr
like syntax in much the way dplyr
already does for SQL? I'm not sure that we really want to use rdflib
, but I am intrigued by the ability to walk the graph with variables and select and filter with SQL like statements...
Alternatively we could just use R to walk the list, in tidyversedashboard I wrote a simple function to do a depth first search of a nested list returning the index of the first location that matched a predicate function. Extending this to return indexes of all the matches rather than only the first would solve this problem, here is the example with the current implementation which returns only the first match.
dfs_idx <- function(x, f) {
res <- integer()
walk <- function(x, depth) {
for (i in seq_along(x)) {
if (isTRUE(tryCatch(f(x[[i]]), error = function(e) FALSE))) {
res[[depth]] <<- i
return(TRUE)
} else {
if (is.list(x[[i]]) && isTRUE(walk(x[[i]], depth + 1))) {
res[[depth]] <<- i
return(TRUE)
}
}
}
}
walk(x, 1)
res
}
x <- jsonlite::fromJSON('{
"stuff": {
"buried": {
"deep": [
{
"goodstuff": "here",
"name": "Bob Rudis",
"secret_power": 5
},
{
"goodstuff": "here",
"name": "Amanda Dobbyn",
"secret_power": 4,
"more_nested_stuff": 4
}
],
"alsodeep": 2342423234,
"deeper": {
"foo": [
{
"goodstuff": 5,
"name": "barb"
}
]
}
}
}
}', simplifyVector = FALSE)
x[[dfs_idx(x, function(x) { x$goodstuff == "here" })]]$name
#> [1] "Bob Rudis"
Created on 2018-05-19 by the reprex package (v0.2.0).
@jimhester 🎉 yup, a native R solution sounds much more sensible, really like this example. A library of such functions sound like it could be super useful. Obviously we have purrr
, but I almost seems surprising that we don't have any built-in recursive tree/search algorithms like this, isn't it? Would it make sense to have a C implementations of these? (I never really learned how to do recursive searches properly and hate writing my own versions of these; usually try to just steal from stuff @jimhester already has in xml2
, e.g. as_list
).
I think this would still leave open the issue of an expressive syntax to go from ugly list to nice data.frame. In particular, one thing I still struggle with is being able to assemble data from different 'depths' in the tree, e.g. go back up from where the good stuff is to get an identifier. e.g.
obj <- x[[dfs_idx(x, function(x) { x$goodstuff == "here" })]
## works:
obj$name
obj$secret_power
## but what if we could also 'bump up the stack'
##
obj$$alsodeep
(imagine $$
is to ..
as $
is to .
) Or for another metaphor, like the way we can refer to a parent node as well as a child node in XPATH, or the graph-based query in SPARQL (or maybe graphql
etc).
So one nice thing is the index is actually a vector of indices, so to access the parents you just remove the last one from the index, e.g.
idxs <- dfs_idx(x, function(x) { x$goodstuff == "here" })
idxs
#> [1] 1 1 1 1
parent_idx <- tail(idxs, n = -1)
x[[parent_idx]]
I agree it would be nice to have these functions in purrr ideally with a C implementation. If we do want to go down this route we can discuss a nice API / helper functions to make everything smoother, I think in combination with purrr::map_dfr()
and friends we could make something pretty nice that would reuse a lot of the existing idioms we already have.
@jimhester I had missed that, that is indeed very nice! Good to get some wider input while we're together and hash out what a nice API based around familiar purrr
or dplyr
idioms would be fantastic.
Could you see such a C implementation getting onto the dev roadmap of the tidyverse team? Guessing that part would be beyond the scope of what we could get done in Seattle...
We could do the C implementation at the unconf, but our time I think would be better spent on generating more good problem cases (like you already did) and hashing out the API.
But I know Jenny and I care better ways of dealing with these ragged arrays, so I am sure once we have the API we will be able to work on an efficient implementation
Unconf project repo: https://github.com/ropenscilabs/roomba
I think the goal for the unconf would be to lay the foundation for an eventual package that is meant to sit in a user’s pipeline directly after a
jsonlite::fromJSON()
call, in place of initial bespoke munging of the nested list. The package would:.empty
values at each level of the list (i.e., convert empty elements andNULL
s toNA
s or a user-defined value) so the tibble can be easily unnestedThe first step here (which is the goal for the unconf -- many thanks to @jennybc for working through the initial idea with me) seems to me to be defining patters in 1) API requests and their resulting responses and 2) in the most common/successful strategies people use in their tidying process. I think it would be useful to query a few more-or-less representative RESTful APIs and note the commonalities in the solutions for tidying them. The idea would be to extract the intersection of these solutions into general-purpose verbs and also to identify where these approaches fail.
I could see this package being useful not just for one-off data grabbing and tidying jobs but also for developing packages that interface with an arbitrary API. Could of course be used on any nested list, but I think it makes sense to keep the scope of the package focused to API data.
For a name, I’m thinking
roomba
[^1] but definitely not married to it.[^1]: Provided that’s cool with the relevant trademark attorneys :satisfied: