Splitting a dataframe and serialising rows

ropensci / nodbi

Document DBI connector for R

https://docs.ropensci.org/nodbi

Other

76 stars 6 forks source link

Splitting a dataframe and serialising rows #7

Closed richfitz closed 1 year ago

richfitz commented 9 years ago

I'm going to implement this in C++ for rrlite, but might be really common. The problem: for me, reading out of diamonds row by row using [.data.frame is too slow to be practically useful. At the same time I might as well use RApiSerialize to do the serialization.

Incredibly, jsonlite:::asJSON(diamonds, collapse = FALSE) is faster than split(diamonds, seq_len(nrow(diamonds))) (0.446s vs 21s) despite doing so much more.

As soon as this looks to be common functionality, I guess we should move it into here?

jeroen commented 9 years ago

I mentioned this to @hadley at the unconf. It is unclear to me why selecting some rows from a data frame is so terribly slow. There should probably be a special function for this in base R.

richfitz commented 9 years ago

I totally agree. It should not actually be that hard though: apply is heaps faster: apply(diamonds, 1, list) (not quite equivalent) runs in about 0.5s. Exposing whatever underlying code that apply is using to address rows would probably be useful.

jeroen commented 9 years ago

Apply coerses the data frame to a matrix so you lose the type of the columns

richfitz commented 9 years ago

:cry: I didn't even check, but you are of course right. So sad.

sckott commented 9 years ago

:+1: agree this should be much faster.

richfitz commented 9 years ago

A small bit of Rcpp can split diamonds into a list of lists which is good enough in about 0.07s so that's going to be much more practical :rocket:

karthik commented 9 years ago

I hadn't heard of https://github.com/eddelbuettel/rapiserialize

so useful! I should learn to search more before implementing anything.

richfitz commented 9 years ago

Ouch, dude :stuck_out_tongue:

hadley commented 9 years ago

This is what we do in dplyr: https://github.com/hadley/dplyr/blob/master/R/tbl-df.r#L114-L146.

It's a little tricky to do it only in C++ because you may want to do S3 dispatch. If that's needed, you might have a preprocessing step like:

is_object <- vapply(df, is.object, logical(1))
df[is_object] <- lapply(df[is_object], as.character)

This is what I do in readr::write_csv()

richfitz commented 9 years ago

Thanks - that's very useful and saves poking around. I've implemented almost the same coercion code for factors alone, but you're right it'd need to doing for other types. The downside is it means that roundtrip would not preserve everything. I'm looking forward to using readr.

rfhb commented 1 year ago

Closing as unclear if still relevant and as streaming and serialisation as NDJSON is used for some of the database backends.