rstudio / plumber

Turn your R code into a web API.
https://www.rplumber.io
Other
1.39k stars 255 forks source link

R process not releasing memory; Want aggressive garbage collection #927

Closed schloerke closed 10 months ago

schloerke commented 10 months ago

Example application or steps to reproduce the problem

Router:

library(plumber)
library(readr)

#* @apiTitle Plumber Test Memory Leak

#* Download data
#* @get /download_data
#* @serializer csv
function() {
    data <- data.frame(
        COLUMN_1 = rep("COLUMN_1",1e+8),
        COLUMN_2 = rep("COLUMN_2",1e+8),
        COLUMN_3 = rep("COLUMN_3",1e+8),
        COLUMN_4 = rep("COLUMN_4",1e+8),
        COLUMN_5 = rep("COLUMN_5",1e+8)
    )
    return(data)
}

#* Release Memory
#* @get /release_memory
function() {
    gc()
    return("ok")
}

Describe the problem in detail

Using https://github.com/rstudio/plumber/issues/496#issuecomment-541402503 (where LD_PRELOAD is being set to a better malloc library), we can see we have a smaller final footprint after gc() is called. But we need to manually trigger gc() to make the R process footprint reduce in size.

How can we call gc() and not slow down our routes?

schloerke commented 10 months ago

It is possible to add a postserialize hook that will run after the serialization has occurred.

Ex:

library(plumber)
pr() %>%
  pr_hook("postserialize", function(req){
    message("Routing a request for ", req$PATH_INFO)
    # Only run this hook if the request is for the root path
    if (req$PATH_INFO == "/") {
      message("in postserialize")
      later::later(function() {
        message("in postserialize later")
        message("calling gc()!")
        gc()
      }, delay = 0)
      message("exiting postserialize")
    }
  }) %>%
  pr_handle("GET", "/", function(){
    message("in route")
    123
  }) %>%
  pr_run()

Running a request against / gives the these messages in the console:

in route
Routing a request for /
in postserialize
exiting postserialize
in postserialize later
calling gc()!

It shows that the gc() call happens after the postserialize has exited (and also after the response has been sent (not shown in print statements)).

The logic could be updated to work for every route, but that is a little too aggressive. Try to limit your calls to gc() as it does take tangible time to run. It is recommended to only done for routes that are believed / known to need a lot of memory cleanup.

Why wouldn't you add it earlier in the execution of the route?

It is possible that your route uses promises or future to keep the main R worker free to execute other requests. We should only run gc() when we're needing to reduce the large footprint. If we add it earlier and a promise-like route executes, then the gc() would run before the promise-like route is resolved... which would leave the larger footprint from the route (until a followup gc() is called).

Ideally, we naturally would make a larger footprint from a route and when everything for the route has completed, we call gc() to reduce the memory footprint.