Unreasonable memory usage during handling of the multipart requests

dselivanov commented 4 years ago

From #150 (and I can reproduce this):

I’ve investigated the issue further. As far as I can gather, RestRserve quickly drains the container’s memory when the data passed to it are in multipart form.

I’ve monitored the container’s resources with docker stats and varied the container’s available memory with the –-memory option when starting the container. When the container has X amount of RAM available, a file of size Y will cause the OOM killer to kill the process resulting in an empty reply from server. When the body is passed as text/plain, the ratio of X and Y is about 2, i.e. when the container has 1 GB of memory available, up to 500 MB can be passed to the endpoint successfully. With a multipart body, the ratio is about 7, i.e. we need to allocate about 7 times the amount of memory to the container as the largest file we wish to pass (not considering concurrent requests). I got roughly the same results running the container on windows and linux.

The multiplier of 7 is prohibitive with large file sizes and possible concurrent requests. Would you consider investigating and possibly optimizing the code?

artemklevtsov commented 4 years ago

Can you provide a code to reproduce this?

dselivanov commented 4 years ago

Same as in the #150. This takes more than 2gb of RAM during request processing. Seems related to the https://github.com/rexyai/RestRserve/blob/25ca00a784e3d1681c84867954ba93b95466bca1/src/parse_multipart.cpp#L106

library(data.table)
library(httr)
# Generate data
n<-20000000      # observations
dta<-data.frame(
  numVar=round(rnorm(n),3),
  charVar=c("foo", "bar")[(runif(n)<0.5)+1]
)

# Write to file
tmp <- tempfile()
fwrite(dta, tmp)
utils:::format.object_size(file.info(tmp)$size, "auto")
#> [1] "198.1 Mb"

# POST request with file
time<-system.time({
  rs <- POST(
    url = "http://127.0.0.1:8080/echo",
    body = list(dta = upload_file(tmp)),
    encode = "multipart"
  )
})

artemklevtsov commented 4 years ago

If issue in the cpp_parse_multipart_body we can use it directly same as shown in test-parse-multipart.R.

dselivanov commented 4 years ago

Surprisingly the issue seems is in the request$get_file()

dselivanov commented 4 years ago

The issue is in a way how raw vector is sliced: https://github.com/rexyai/RestRserve/blob/25ca00a784e3d1681c84867954ba93b95466bca1/R/Request.R#L211.

In case of large file (200m as in the example) the sequence of indices to get the file from the blob is large (200m integers or numeric which is is 800mb/1600mb of RAM). Even with R's ALTREP these indices are materialized during the raw vector subset.

An option would be to use R's readBin function, but unfortunately it doesn't support offset argument... Similar issue described here.

So I've created a small C++ function to slice raw vectors. @rplati see dev branch.

Now memory usage should be very small (~2x of the size of multipart request as we essentially doubling memory if we read file from blob and allocate a separate variable).

rplati commented 4 years ago

I can confirm that parsing the multipart body no longer draws unreasonable amounts of memory when starting from Docker image rexyai/restrserve:dev. Many thanks for rapidly fixing the issue!

rexyai / RestRserve

Unreasonable memory usage during handling of the multipart requests #151