jeroen / jsonlite

A Robust, High Performance JSON Parser and Generator for R
http://arxiv.org/abs/1403.2805
Other
376 stars 40 forks source link

simplifyDataFrame can produce columns that don't line up with the rest of the data.frame #225

Open mattpollock opened 6 years ago

mattpollock commented 6 years ago

I need to parse json that has the following form:

[
  {
    "key": "item 1",
    "points": [
      {
        "nested_key": "item 1,1",
        "nested_value": 11,
        "nested_string": "point 1 of item 1"
      },
      {
        "nested_key": "item 1,2",
        "nested_value": 12,
        "nested_string": "point 2 of item 1"
      },
      {
        "nested_key": "item 1,3",
        "nested_value": 13,
        "nested_string": "point 3 of item 1"
      }
      ],
    "attribute_counts": {
      "color": {"blue": 57, "red": 3012},
      "size": {"12": 2354, "15": 103},
      "name": {"Linda": 234, "Julie": 1}
    }
  },
  {
    "key": "item 2",
    "points": [
      {
        "nested_key": "item 2,1",
        "nested_value": 21,
        "nested_string": "point 1 of item 2"
      },
      {
        "nested_key": "item 2,2",
        "nested_value": 22,
        "nested_string": "point 2 of item 2"
      },
      {
        "nested_key": "item 2,3",
        "nested_value": 23,
        "nested_string": "point 3 of item 2"
      },
      {
        "nested_key": "item 2,4",
        "nested_value": 24,
        "nested_string": "point 4 of item 2"
      }
      ],
    "attribute_counts": {
      "color": {"blue": 123, "red": 232},
      "size": {"12": 623, "15": 143},
      "name": {"Bob": 234, "Sam": 1}
    }
  }
]

The problem is with the attribute_counts field. It seems that since this is an object, but not an array of objects (like points is, which fromJSON handles with ease) it does not get lined up according to row (i.e., in the data.frame row corresponding to the top level object it comes from) but rather the attribute counts from all top level fields are rbinded together in a way that does not seem intentional.

All is well if I use simplifyVector=FALSE (except that this data would be much easier to work with as a data.frame - but from a "correctness" perspective)

lst <- fromJSON("path/to/pasted/json/above", simplifyVector = FALSE)
length(lst)
# [1] 2
lapply(lst, length)
# [[1]]
# [1] 3
# 
# [[2]]
# [1] 3

The transpose of this also looks good

tlst <- jsonlite:::transpose_list(lst, names(lst[[1]]))
length(tlst)
# [1] 3
lapply(tlst, length)
# [[1]]
# [1] 2
# 
# [[2]]
# [1] 2
# 
# [[3]]
# [1] 2

But things go wrong when I use simplifyVector=TRUE

df <- fromJSON("path/to/pasted/json/above", simplifyVector = TRUE)
colnames(df)
# [1] "key"              "points"           "attribute_counts"

lapply(df, length)
# $key
# [1] 2
# 
# $points
# [1] 2
# 
# $attribute_counts
# [1] 3                        # <- yikes!

lapply(df, class)
# $key
# [1] "character"
# 
# $points
# [1] "list"
# 
# $attribute_counts
# [1] "data.frame"

tibble::as.tibble(df)
# Error: Column `attribute_counts` must be a 1d atomic vector or a list

The attribute_counts for both items 1 and 2 are lumped together in their own data.frame. The column breaks what a data.frame is since the column lengths do not match.

Isolating the attribute_counts data.frame does not help either because each field contains a data.frame instead of a list of nested data.frames. Base data.frame will print it in an unhelpful way, but tibble::as.tibble and data.table::as.data.table won't work with it.

df$attribute_counts
#   color.blue color.red size.12 size.15 name.Linda name.Julie name.Bob name.Sam
# 1         57      3012    2354     103        234          1       NA       NA
# 2        123       232     623     143         NA         NA      234        1

lapply(df$attribute_counts, class)
# $color
# [1] "data.frame"
# 
# $size
# [1] "data.frame"
# 
# $name
# [1] "data.frame"

tibble::as.tibble(df$attribute_counts)
# Error: Columns `color`, `size`, `name` must be 1d atomic vectors or lists

data.table::as.data.table(df$attribute_counts)
# Error in FUN(X[[i]], ...) : 
#   Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.

I'm not sure what the correct behavior ought to be, but either parsing the attribute_counts for each item into a 1-row data.frame or leaving it as a nested list seem like fine options to me.

Thanks

jeroen commented 6 years ago

This is expected. You cannot convert attribute_counts to a tibble because it has nested data. There is no obvious way to flatten it into a rectangle.

One thing you can do is convert the nested fields:

out <- jsonlite::fromJSON(json)
tibble::as.tibble(out$attribute_counts$color)
tibble::as.tibble(out$attribute_counts$size)
tibble::as.tibble(out$attribute_counts$name)

But for such complex data you may be better of using simplifyVector=FALSE and manually writing code to convert the data into the required structure.

mattpollock commented 6 years ago

Thanks for the quick response. Unfortunately I'm after a more general solution (I have lots of different data types that combine nested arrays and non-array objects at the top level which I need to work with).

I see your point about not being able to coerce everything into a rectangle, but given the relationship between data.frame rows and top level json objects, it seems to me that leaving attribute_counts (and similar structures) as list columns would be more usable than the current return value. Thus, df$attribute_counts would be a list of length nrow(df) and df$attribute_counts[[1]] would be a list (not a data.frame) containing the attribute counts from the original json corresponding to item 1.

Perhaps I should ask, is the behavior here expected because it is useful in another context or is it expected because jamming all json objects into data.frames doesn't always work?