jeroen / mongolite

Fast and Simple MongoDB Client for R
https://jeroen.github.io/mongolite/
284 stars 64 forks source link

Memory issues when reading a large collection #217

Closed demirev closed 3 years ago

demirev commented 3 years ago

I have an issue where reading a big collection from mongo results in additional memory being taken up by the R process and never released back.

E.g. in the snippet below I create some data with 1.6M rows and 16 columns which I insert in mongo:

fakeData <- map_dfr(1:50000, function(x) bind_cols(iris[1:32,], mtcars))

connection <- mongo(
  db = "test",
  collection = "test0",
  url = "mongodb://localhost:27017"
)

connection$insert(fakeData)

Then after restarting the R process I run

library(dplyr)
library(purrr)
library(mongolite)

connection <- mongo(
  db = "test",
  collection = "test0",
  url = "mongodb://localhost:27017"
)

fakeReadout <- connection$find('{}')
rm(fakeReadout)
connection$disconnect()
rm(connection)
gc(full = T)

After reading and then removing the data (as well as disconnecting the collection and running gc just in case), I now have an empty workspace, but R takes up ~18% of my RAM as indicated by top.

Is this some issue related to mongolite's memory management, or am I missing something?

PS: the output of sessionInfo:

R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=bg_BG.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=bg_BG.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=bg_BG.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=bg_BG.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] mongolite_2.3.0 purrr_0.3.4 dplyr_1.0.4

loaded via a namespace (and not attached): [1] assertthat_0.2.1 crayon_1.3.4 R6_2.5.0 jsonlite_1.7.1 DBI_1.1.0 lifecycle_0.2.0 magrittr_2.0.1 pillar_1.4.6
[9] rlang_0.4.10 rstudioapi_0.13 blob_1.2.1 vctrs_0.3.6 generics_0.1.0 ellipsis_0.3.1 tools_4.0.3 glue_1.4.2
[17] compiler_4.0.3 askpass_1.1 pkgconfig_2.0.3 openssl_1.4.3 tidyselect_1.1.0 tibble_3.0.4

jeroen commented 3 years ago

I think R keeps some memory in its pool even after the objects have been collected. Did you rule out that the problem does not appear without mongolite?

demirev commented 3 years ago

Yes - just reading the same dataset from RDS results in only 4% memory allocation to R (out of 16G), so 4-5x difference compared to using find

jeroen commented 3 years ago

That doesn't proof that it is leaking though. If you run your original experiment several times, does it keep adding more and more memory, or does it stabilize?

The simplification step in connection$find() just needs a lot of memory, which R might keep around for future allocations.

demirev commented 3 years ago

I think it stabilizes, so it could be just the normal behavior. I'll confirm tomorrow after I try it a couple more times though.

If this turns out to be the case do you have any advice on how to reduce the memory usage of find? Or maybe just a rule of thumb how much memory I need relative to the size of the data I'm trying to read?

I originally ran into this in a workflow where the data I want to read is small enough to fit in memory, but running the query takes up all the resources and the OOM killer steps in.

jeroen commented 3 years ago

The most memory expensive part is the final operation which has to combine all of the json documents into one big dataframe.

If you set a handler function in your find() to incrementally process chunks of data, then it will probably need much less memory.

demirev commented 3 years ago

Thanks, I'll try that. Meanwhile I'll mark this as closed, since it doesn't seem to be a leak

demirev commented 3 years ago

Just for completeness:

I solved my memory issues by using a custom handler as suggested. Specifically:

findWithHandler <- function(connection, ...) {
  handler <- R6::R6Class(
    classname = "mongo_handler",
    public = list(
      data = list(),
      handle = function(x) {
        count <- parent.frame(n = 2)$count # count comes from within the mongo_stream_in function
        self$data[[as.character(count + length(x))]] <- x
      },
      reduce = function() {
        dplyr::bind_rows(self$data[order(as.numeric(names(self$data)))])
      }
    )
  )

  thisHandler <- handler$new()

  connection$find(handler = thisHandler$handle, ...)

  thisHandler$reduce()
}

Now that is almost the same thing that happens within the mongo_stream_in function if no handler is supplied.

The only difference is that mongo_stream_in stores the object page (which is the output of mongo_cursor_next_page) - a direct conversion of the mongo json objects to lists - in the out environment at each step (and converts all page objects to a single data frame in the end), while the handler above stores data frames at each step and binds them together to return the final data (the argument passed to the handler is as.data.frame(simplify(page))).

It turns out that as.data.frame(page) is just much much smaller than page. For the data I was working with it was something like 5x+ difference in object.size, so avoiding storing the intermediate lists and just working with data frames throughout enabled me to load the data in memory.