Closed richfitz closed 1 year ago
I mentioned this to @hadley at the unconf. It is unclear to me why selecting some rows from a data frame is so terribly slow. There should probably be a special function for this in base R.
I totally agree. It should not actually be that hard though: apply
is heaps faster: apply(diamonds, 1, list)
(not quite equivalent) runs in about 0.5s. Exposing whatever underlying code that apply is using to address rows would probably be useful.
Apply coerses the data frame to a matrix so you lose the type of the columns
:cry: I didn't even check, but you are of course right. So sad.
:+1: agree this should be much faster.
A small bit of Rcpp can split diamonds into a list of lists which is good enough in about 0.07s so that's going to be much more practical :rocket:
I hadn't heard of https://github.com/eddelbuettel/rapiserialize
so useful! I should learn to search more before implementing anything.
Ouch, dude :stuck_out_tongue:
This is what we do in dplyr: https://github.com/hadley/dplyr/blob/master/R/tbl-df.r#L114-L146.
It's a little tricky to do it only in C++ because you may want to do S3 dispatch. If that's needed, you might have a preprocessing step like:
is_object <- vapply(df, is.object, logical(1))
df[is_object] <- lapply(df[is_object], as.character)
This is what I do in readr::write_csv()
Thanks - that's very useful and saves poking around. I've implemented almost the same coercion code for factors alone, but you're right it'd need to doing for other types. The downside is it means that roundtrip would not preserve everything. I'm looking forward to using readr.
Closing as unclear if still relevant and as streaming and serialisation as NDJSON
is used for some of the database backends.
I'm going to implement this in C++ for rrlite, but might be really common. The problem: for me, reading out of diamonds row by row using
[.data.frame
is too slow to be practically useful. At the same time I might as well use RApiSerialize to do the serialization.Incredibly,
jsonlite:::asJSON(diamonds, collapse = FALSE)
is faster thansplit(diamonds, seq_len(nrow(diamonds)))
(0.446s vs 21s) despite doing so much more.As soon as this looks to be common functionality, I guess we should move it into here?