jeroen / mongolite

Fast and Simple MongoDB Client for R
https://jeroen.github.io/mongolite/
286 stars 65 forks source link

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Open atheriel opened 6 years ago

atheriel commented 6 years ago

This is an attempt to address #37. As I mentioned in that issue, I decided to add a new C-level parser that produces data frames instead of nested lists. This means that the results can simply be rbind()-ed together, or, for improved performance, using dplyr::bind_rows() or data.table::rbindlist().

The approach has one obvious downside: it introduces a large amount of new code.

In its current incarnation, the new parser is opt-in -- I haven't changed any of the existing C code -- requiring you to pass a flat = TRUE parameter to the find() method. This also allowed me to test that it produces the same output as the current parser, and run the test suite with few modifications.

The new approach appears to be much faster for larger data sets. As a quick example:

m <- mongo("test_flights", verbose = FALSE)
if(m$count() == 0) {
  m$insert(flights)
  m$insert(flights)
  m$insert(flights)
}

m$count()  # About 1 million.

microbenchmark::microbenchmark(
  before = m$find("{}", flat = FALSE),
  after = m$find("{}", flat = TRUE),
  times = 5
)
Unit: seconds
   expr       min        lq      mean    median        uq       max neval
 before 18.964424 21.575127 23.314231 23.164931 23.584352 29.282319     5
  after  3.394779  3.531145  4.390083  4.251457  4.310999  6.462035     5

As you can see, it's about 5x faster here.

I have a few outstanding design questions, namely:

1) Does an opt-in approach with a parameter (to find()) make sense? and 2) Does the C code look OK?

I am happy to refactor things or redesign the API at your suggestion. I recognize that this is an enormous PR to review, and am willing to help you out as much as you'd like on that front. The code also contains some detailed comments explaining the approach.

codecov-io commented 6 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (master@32147b3). Click here to learn what that means. The diff coverage is 4.53%.

Impacted file tree graph

@@           Coverage Diff            @@
##             master    #148   +/-   ##
========================================
  Coverage          ?   36.5%           
========================================
  Files             ?     130           
  Lines             ?   22295           
  Branches          ?       0           
========================================
  Hits              ?    8139           
  Misses            ?   14156           
  Partials          ?       0
Impacted Files Coverage Δ
src/cursor.c 4% <0%> (ø)
R/client.R 37.37% <0%> (ø)
R/mongo.R 49.7% <100%> (ø)
R/stream.R 38.93% <57.14%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 32147b3...98832aa. Read the comment docs.

jimhester commented 3 years ago

@jeroen is the main holdup for this PR the dplyr / data.table dependencies?

If so I think we could just write a basic concatenation function in C that will give you comparable performance to the dplyr and data.table implementations and avoid the extra dependencies.

All the column types in the pages should be consistent right? So it would just be a matter of allocating each of the output columns with the total number of rows and copying the data.

atheriel commented 3 years ago

@jimhester I think that would be an easy solution to that issue. But I suspect the main roadblock for this is the size of the new code. From the issue thread:

I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.

The other outstanding issue (in my view) is whether this should just become the default parser or if it needs to be hidden behind a parameter (called flat at present) at all.

jeroen commented 3 years ago

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

The jsonlite::simplfy function is slow because it does a lot of complicated messy reshaping on the structures to convert lists into vectors and data frames where possible. This process has many edge cases, for which jsonlite might have different behavior than other packages. So I don't think this is just a drop-in replacement, it will change the results.

And also I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

atheriel commented 3 years ago

My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.

I think this parser falls back to using the existing code whenever there are nested objects, so it should handle them the same way.

I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.

I respect that. But I'm not sure how it would be possible to make this extensible at the right layer -- the performance improvements here come from handling the C-level mongo cursor directly.