sailthru / tidyjson

Tools for using dplyr with JSON data
Other
161 stars 14 forks source link

iteratively constructing a json data frame makes problems on 'later' mutation #48

Open behrica opened 9 years ago

behrica commented 9 years ago

I construct a json data frame in a loop, like this:

papers = data.frame()

for (....) {
 current_papers <- content %>% as.tbl_json %>%
                enter_object("items") %>% gather_array() %>%
                spread_values(title=jstring("title"),
                              snippet= jstring("snippet"),
                              link= jstring("link"),
                              displayLink=jstring("displayLink")) %>%
                enter_object("pagemap","metatags") %>% gather_array() %>%
                spread_values(creationdate=jstring(creationdate_field),
                              moddate=jstring(moddate_field)) 

            papers <- rbind(papers,current_papers)

}

this seems to work fine, (the data frame looks good) but using "mutate" from dplyr on it, like this

papers <-
    papers %>%
    mutate(clickLink=paste0('=HYPERLINK("',link,'","link")'))

gives a very strange error message

Error in `$<-.data.frame`(`*tmp*`, "..JSON", value = list(list(author = "Microsoft Office User",  : 
  replacement has 9 rows, data has 40

Converting it to a dataframe first, does work:

  papers %>% data.frame() %>%
    mutate(clickLink=paste0('=HYPERLINK("',link,'","link")'))

Is this a bug in tidyjson or do I do something wrong ?

vats-div commented 9 years ago

I think tidyjson does not support rbind. What I mean by this is that if we bind two tbl_json objects, then we loose the structure of the json object. You can probably see this if you type attr(papers, 'JSON') and you'll only see the first JSON object. When you call data.frame on it, then it is doing mutate on a data.frame object and it does not need to do any JSON object manipulation.

behrica commented 9 years ago

I thought about this. so probably in my loop I should convert to a normal data frame before doing rbind.

Is there a way tidyjson could fail on doing the rbind ?

It seems that the tidyjson object is a data.frame, (so should support all operations on it), while it is not.

Maybe the documentation could mention that it is a good idea, to convert to an data frame after having finished the json parsing.

Maybe even better: Could you maybe add a specific methods: "toDataFrame" or similar, which does the conversion and removes the specific index columns (which you never care abotu after having finished the json handling)

colearendt commented 7 years ago

Support for bind_rows has been added to the development version here. Use devtools::install_github('jeremystan/tidyjson') to explore - I find this version superior to the CRAN version.

Further, tbl_df can be used to discard the JSON components of the tbl_json object

parsed <- my_json %>% ... ## parse the JSON
more_munging <- parsed %>% tbl_df %>% bind_rows... ## Other manipulation