ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

package (or expanded functionality to tidyjson?) for creating nested data frame from JSON data #81

Open sctyner opened 7 years ago

sctyner commented 7 years ago

I've recently become a fan of the purrr package, so much so that when I was working with a bunch of JSON data at a hackathon this past weekend, I kept wishing that I could use purrr and tidyr functionality to just magically turn all the JSON data into a tidy nested data frame.

I used the function jsonlite::fromJSON() to convert the JSON data to a data frame, but it did not have the nice tidyverse structure I've come to know and love.

Is there a way to get nice nested tidy data frames from JSON? I've looked into tidyjson a bit but it's not exactly what I would imagine if I had to pick the best way to get JSON data in to tidy, nested data frames. (I'm not sure if it does nesting TBH; I haven't looked into it that much.) This seems like something that would be extremely helpful for Shiny App developers who depend on API data, or anyone using JSON data in R with the tidyverse packages.

jennybc commented 7 years ago

@hadley and I talk about this regularly. It is a real need! I have some ideas about verbs for such a package. In some sense jsonlite::flatten() does this, but then you end up needing more control. However, you don't want to do every last thing by hand, either. I definitely know what you mean!

To be more of a Debbie Downer, I'm not sure how amenable this project is to the Unconf format. I think it requires development of a very principled worldview and then a very disciplined implementation. It feels really hard to do in this setting.

cboettig commented 7 years ago

@sctyner I think this is a great idea. While I agree that actually implementing this in a robust package might be better saved for later, I'd love to see a unconf discussion group synthesize some of the knowledge on how best to go about this (and how not to!) Almost every time I run into this issue I think "if only I could mind-meld with Jenny right now..."

One thing I've been playing with recently in this area: json-ld provides a method called "Flatten", which is often coupled to a second method called "Frame" which basically lets you write a little JSON template which specifies what data you want and how you want it structured. Note that JSON-LD methods can be applied to any JSON file, and yes @jeroen has an awesome little jsonld package providing convenient access to these methods. Probably not the right hammer for every task, but one might imagine a package implementing a similar notion of flatten & frame if not directly calling these existing methods.

jeroen commented 7 years ago

What sort of data json data structures do you encounter for which a logical mapping into a data frame would exist, but jsonlite::toJSON(..., flatten = TRUE) isn't well suited?

jennybc commented 7 years ago

@jeroen I will try to note down specific examples as I re-encounter. But in the abstract, here are some of the reasons I set flatten = FALSE in fromJSON():

Basically it feels like there are a lot of actions that must be part of flatten(). And I want access to them myself.

I also have a more general discomfort that the structure of the thing I get back depends on the data. I'm happy to use flatten = TRUE when first getting to know an API but then I gradually creep towards doing the list processing myself.

sctyner commented 7 years ago

@jennybc That's exactly it. I wish I had said it as elegantly! 😄

cboettig commented 7 years ago

Here's a toy example of some JSON that doesn't flatten to a data.frame. Very nested structures are usually the source of issues for me, even with purrr, because often I want to pull data found at various different levels of nesting into a single row for the data.frame I care about.

Below I also give an example of how one might address this using the idea of a frame.

library("jsonlite")
library("jsonld")
library("magrittr")

json <-'{
      "@id": "http://example.org/library",
      "@type": "ex:Library",
      "ex:contains": {
        "@id": "http://example.org/library/the-republic",
        "@type": "ex:Book",
        "ex:contains": {
          "@id": "http://example.org/library/the-republic#introduction",
          "@type": "ex:Chapter",
          "dc:description": "An introductory chapter on The Republic.",
          "dc:title": "The Introduction"
        },
        "dc:creator": "Plato",
        "dc:title": "The Republic"
      }
    }
  '
df <-fromJSON(json, flatten = TRUE)

## not a df
class(df)

Note that df is still a (rather cumbersome!) list. Like @jennybc says, this is particularly annoying because the type/structure is unpredictable (depends on how much a nesting a given element might have), so hard to program around, so we usually wind not flattening the data (but having to iterate over some often ugly nesting).

framing solution

Let's imagine I just want to pull out book titles from the middle of that nested structure. Here's a frame for that:

frame <-
'{
  "@explicit": "true",
  "@type": "ex:Book",
  "dc:title": {}
}'
jsonld_frame(json, frame) %>% fromJSON()

gives us:

                                      @id   @type     dc:title
1 http://example.org/library/the-republic ex:Book The Republic

How about a data frame with the title and creator for all objects, regardless of nesting depth:

frame <-
'{
  "@explicit": "true",
  "@id": {},
  "dc:title": {"@default": "NA"},
  "dc:creator": {"@default": "NA"}
}'

jsonld_frame(json, frame) %>% fromJSON()
                                                 @id      @type dc:creator         dc:title
1                           http://example.org/library ex:Library         NA               NA
2              http://example.org/library/the-republic    ex:Book      Plato     The Republic
3 http://example.org/library/the-republic#introduction ex:Chapter         NA The Introduction

As an aside, I think this is potentially relevant to the schema issue @stephlocke mentions in https://github.com/ropensci/unconf17/issues/85#issuecomment-302045099 . Framing was designed explicitly for the problem Steph mentions, where you want your tool to get a consistent data object back even if the data provider is doing things like adding more data.

stephlocke commented 7 years ago

Thanks @cboettig, I'll definitely see if I can get this schema framing going in the context of some of my API packages.

cboettig commented 7 years ago

@stephlocke cool, I'd be curious to hear how it goes. In case it's helpful, here's official docs on writing a frame. the json-ld playground is also useful for testing & sharing frames.

Maybe it would just be re-inventing the wheel, but I could also see a tidyjson package merely borrowing the framing notion as as a way of saying "give me these elements, where-ever you find them, and give them back to me in this structure". (Just realizing in my examples above, they all request an unnested structure, and then piping that through fromJSON is resulting in simple data.frames, but as you'll see you could say "I want a list nested just like so" instead.)

sctyner commented 7 years ago

We are over by the stage at a small round table. There's three of us: Katie, Kelly, and myself!

sfirke commented 7 years ago

I am planning to try to revive the SurveyMonkey -> R API package (RMonkey, which broke when SurveyMonkey changed their API and is not currently maintained) in the near future and would love to use whatever you come up with to parse the (very nested) JSON if at all applicable.

stefaniebutland commented 7 years ago

Repo: https://github.com/ropenscilabs/tidyerjson