Open atheriel opened 6 years ago
:exclamation: No coverage uploaded for pull request base (
master@32147b3
). Click here to learn what that means. The diff coverage is4.53%
.
@@ Coverage Diff @@
## master #148 +/- ##
========================================
Coverage ? 36.5%
========================================
Files ? 130
Lines ? 22295
Branches ? 0
========================================
Hits ? 8139
Misses ? 14156
Partials ? 0
Impacted Files | Coverage Δ | |
---|---|---|
src/cursor.c | 4% <0%> (ø) |
|
R/client.R | 37.37% <0%> (ø) |
|
R/mongo.R | 49.7% <100%> (ø) |
|
R/stream.R | 38.93% <57.14%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 32147b3...98832aa. Read the comment docs.
@jeroen is the main holdup for this PR the dplyr / data.table dependencies?
If so I think we could just write a basic concatenation function in C that will give you comparable performance to the dplyr and data.table implementations and avoid the extra dependencies.
All the column types in the pages should be consistent right? So it would just be a matter of allocating each of the output columns with the total number of rows and copying the data.
@jimhester I think that would be an easy solution to that issue. But I suspect the main roadblock for this is the size of the new code. From the issue thread:
I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.
The other outstanding issue (in my view) is whether this should just become the default parser or if it needs to be hidden behind a parameter (called flat
at present) at all.
My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.
The jsonlite::simplfy function is slow because it does a lot of complicated messy reshaping on the structures to convert lists into vectors and data frames where possible. This process has many edge cases, for which jsonlite might have different behavior than other packages. So I don't think this is just a drop-in replacement, it will change the results.
And also I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.
My main worry is consistency. Collections in monglite can contain arbitrarily deeply nested json objects. Some applications use very complex variable compound structures for each record.
I think this parser falls back to using the existing code whenever there are nested objects, so it should handle them the same way.
I don't feel comfortable taking on maintenance of 700 lines of C code. I would rather design an extensible solution that allows users or packages to provide their own simplifier function, rather than adopting all the dependencies and maintenance work in mongolite.
I respect that. But I'm not sure how it would be possible to make this extensible at the right layer -- the performance improvements here come from handling the C-level mongo cursor directly.
This is an attempt to address #37. As I mentioned in that issue, I decided to add a new C-level parser that produces data frames instead of nested lists. This means that the results can simply be
rbind()
-ed together, or, for improved performance, usingdplyr::bind_rows()
ordata.table::rbindlist()
.The approach has one obvious downside: it introduces a large amount of new code.
In its current incarnation, the new parser is opt-in -- I haven't changed any of the existing C code -- requiring you to pass a
flat = TRUE
parameter to thefind()
method. This also allowed me to test that it produces the same output as the current parser, and run the test suite with few modifications.The new approach appears to be much faster for larger data sets. As a quick example:
As you can see, it's about 5x faster here.
I have a few outstanding design questions, namely:
1) Does an opt-in approach with a parameter (to
find()
) make sense? and 2) Does the C code look OK?I am happy to refactor things or redesign the API at your suggestion. I recognize that this is an enormous PR to review, and am willing to help you out as much as you'd like on that front. The code also contains some detailed comments explaining the approach.