tc / RMongo

R client to interface with MongoDB
102 stars 34 forks source link

dbGetQuery() should optionally omit to attempt to create a data frame #23

Open mgoeker opened 11 years ago

mgoeker commented 11 years ago

Dear RMongo team,

thanks a lot for creating this useful R package. We are currently trying to use RMongo in conjunction with our opm package for database I/O. In our case we are using S4 objects that can be converted to and from JSON or YAML via lists. These are nested lists with a partially undefined structure, fitting well to the MongoDB concept.

I do admit that yielding data frames is normally an appropriate approach in R because users want to work with rectangular data. But in our case the underlying objects are non-rectangular, and we already have our own customized conversion functions to get the S4 objects back from a list.

When trying RMongo with our kind of data, database input was fast and coding was a pleasure (except for the need to convert all dots in R names, but this is apparently a restriction of MongoDB itself.). But database queries were more problematic because they were much slower, even though the database only contained the few previously inserted objects. When running R CMD Rprof on dbGetQuery() I noticed that most of the running time is spent within scan(), which seems to be called via read.csv(). So it might be that most of the time is gone for unnecessarily trying to create a data frame. The data frames were indeed convertible because each field was a JSON string:

mongo2opm <- function(x) { x <- split(x, seq_len(nrow(x))) x <- rapply(x, rjson::fromJSON, "character", NULL, "replace") opms(x, precomputed = FALSE, skip = FALSE, group = TRUE) }

...but there might be some unnecessary code here. (Note that opms() is our own function to obtain the objects we want.) So my question is whether dbGetQuery() could optionally skip all attempts to create a data frame and either return a list or a JSON character string, or whether there are any other solutions to the problem.

Yours Markus

tc commented 11 years ago

This sounds like a good feature, maybe it can be introduced as dbGetQueryJson or dbGetQueryRaw

Would you like to take a shot at it and i'll be happy to review it?

I'm the only dedicated RMongo dev at the moment so i can use your help.

On Tue, Oct 29, 2013 at 10:06 PM, mgoeker notifications@github.com wrote:

Dear RMongo team,

thanks a lot for creating this useful R package. We are currently trying to use RMongo in conjunction with our opm package for database I/O. In our case we are using S4 objects that can be converted to and from JSON or YAML via lists. These are nested lists with a partially undefined structure, fitting well to the MongoDB concept.

I do admit that yielding data frames is normally an appropriate approach in R because users want to work with rectangular data. But in our case the underlying objects are non-rectangular, and we already have our own customized conversion functions to get the S4 objects back from a list.

When trying RMongo with our kind of data, database input was fast and coding was a pleasure (except for the need to convert all dots in R names, but this is apparently a restriction of MongoDB itself.). But database queries were more problematic because they were much slower, even though the database only contained the few previously inserted objects. When running R CMD Rprof on dbGetQuery() I noticed that most of the running time is spent within scan(), which seems to be called via read.csv(). So it might be that most of the time is gone for unnecessarily trying to create a data frame. The data frames were indeed convertible because each field was a JSON string:

mongo2opm <- function(x) { x <- split(x, seq_len(nrow(x))) x <- rapply(x, rjson::fromJSON, "character", NULL, "replace") opms(x, precomputed = FALSE, skip = FALSE, group = TRUE) }

...but there might be some unnecessary code here. (Note that opms() is our own function to obtain the objects we want.) So my question is whether dbGetQuery() could optionally skip all attempts to create a data frame and either return a list or a JSON character string, or whether there are any other solutions to the problem.

Yours Markus

— Reply to this email directly or view it on GitHubhttps://github.com/tc/RMongo/issues/23 .

Tommy Chheng

mgoeker commented 11 years ago

Dear Tommy,

Quoting Tommy Chheng notifications@github.com:

This sounds like a good feature, maybe it can be introduced as dbGetQueryJson or dbGetQueryRaw

Would you like to take a shot at it and i'll be happy to review it?

I'm the only dedicated RMongo dev at the moment so i can use your help.

Sorry for the delayed response. Last week I had a very tight schedule.

I am not familiar with Scala, hence I made a first attempt in R (file
attached). It might be more efficient, however, to avoid the Scala
dbGetQuery method (and the call of toCsvOutput), and to use the JSON
parser that comes with the Java libraries that are loaded anyway. But
at least the return value is the kind of object I had in mind.

Yours Markus


This message was sent using IMP, the Internet Messaging Program.

tc commented 11 years ago

Hi, can you attach the modifications as a pull request?