Open PietrH opened 1 month ago
OpenCPU supports outputting as feather and parquet, and you can pass arguments to their respective writing functions via the url:
https://github.com/opencpu/opencpu/blob/80ea353c14c8601f51ed519744149411d9cc3309/NEWS#L20-L23
The albertkanaal request takes just under 4GB to run on it's own:
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list>
1 "albertkanaal <… 53.3s 53.3s 0.0188 3.62GB 0.131 1 7 53.3s <tibble> <Rprofmem>
This is without writing it to rds.
Feather uses less memory, and is faster, both for reading and writing. But it, in OpenCPU, we can only use it for tabular data (because it passes via arrow::as_arrow_table()
)
However, get_acoustic_detections()
fails on the POST request, not the GET request. So while this does save memory and speeds things up, it doesn't solve our problem.
Writing to local files on the OpenCPU server, fetching the temp key, somehow listing these files, and then fetching them via GET requests over the OpenCPU files API (not documented), example request: curl https://cloud.opencpu.org/ocpu/tmp/x05b85461/files/ch01.pdf
Realistically, once we have a URL, I could read it directly with arrow::read_feather()
, however I'm still not sure where OpenCPU is failing exactly, maybe before it can even store the object in memory?
Benchmarking feather retreival vs rds:
feather is slightly faster and uses way less memory on the client, but also on the service. It does introduce an extra dependency.
It's probably a good idea to repeat the test with a truly big dataset.
$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min <bch:tm> 8.89s
$ median <bch:tm> 9.3s
$ `itr/sec` <dbl> 0.1075611
$ mem_alloc <bch:byt> 31.3MB
$ `gc/sec` <dbl> 0.1398294
$ n_itr <int> 10
$ n_gc <dbl> 13
$ total_time <bch:tm> 1.55m
$ result <list> [<tbl_df[236920 x 20]>]
$ memory <list> [<Rprofmem[123 x 3]>]
$ time <list> <8.89s, 9.02s, 9.67s, 9.34s, 9.22s…
$ gc <list> [<tbl_df[10 x 3]>]
$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min <bch:tm> 9.33s
$ median <bch:tm> 10.5s
$ `itr/sec` <dbl> 0.09537067
$ mem_alloc <bch:byt> 138MB
$ `gc/sec` <dbl> 0.1525931
$ n_itr <int> 10
$ n_gc <dbl> 16
$ total_time <bch:tm> 1.75m
$ result <list> [<tbl_df[236920 x 20]>]
$ memory <list> [<Rprofmem[1834 x 3]>]
$ time <list> <10.27s, 11.19s, 10.92s, 10.52s, 1…
$ gc <list> [<tbl_df[10 x 3]>]
Benchmarking feather retreival vs rds:
feather is slightly faster and uses way less memory on the client, but also on the service. It does introduce an extra dependency.
It's probably a good idea to repeat the test with a truly
$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min <bch:tm> 8.89s
$ median <bch:tm> 9.3s
$ `itr/sec` <dbl> 0.1075611
$ mem_alloc <bch:byt> 31.3MB
$ `gc/sec` <dbl> 0.1398294
$ n_itr <int> 10
$ n_gc <dbl> 13
$ total_time <bch:tm> 1.55m
$ result <list> [<tbl_df[236920 x 20]>]
$ memory <list> [<Rprofmem[123 x 3]>]
$ time <list> <8.89s, 9.02s, 9.67s, 9.34s, 9.22s…
$ gc <list> [<tbl_df[10 x 3]>]
$ expression <bch:expr> <get_acoustic_detections(animal_…
$ min <bch:tm> 9.33s
$ median <bch:tm> 10.5s
$ `itr/sec` <dbl> 0.09537067
$ mem_alloc <bch:byt> 138MB
$ `gc/sec` <dbl> 0.1525931
$ n_itr <int> 10
$ n_gc <dbl> 16
$ total_time <bch:tm> 1.75m
$ result <list> [<tbl_df[236920 x 20]>]
$ memory <list> [<Rprofmem[1834 x 3]>]
$ time <list> <10.27s, 11.19s, 10.92s, 10.52s, 1…
$ gc <list> [<tbl_df[10 x 3]>]
$ expression <bch:expr> <get_acoustic_detections(animal…
$ min <bch:tm> 1.18m
$ median <bch:tm> 1.2m
$ `itr/sec` <dbl> 0.01372749
$ mem_alloc <bch:byt> 485MB
$ `gc/sec` <dbl> 0.04118246
$ n_itr <int> 10
$ n_gc <dbl> 30
$ total_time <bch:tm> 12.1m
$ result <list> [<tbl_df[3500649 x 20]>]
$ memory <list> [<Rprofmem[2601 x 3]>]
$ time <list> <1.27m, 1.25m, 1.24m, 1.22m, 1.21…
$ gc <list> [<tbl_df[10 x 3]>]
Had a child process die while testing the rds version of the query above:
get_val_rds <- function(temp_key, api_domain = "https://opencpu.lifewatch.be") {
# request data and open connection
response_connection <- httr::RETRY(
verb = "GET",
url = glue::glue(
"{api_domain}",
"tmp/{temp_key}/R/.val/rds",
.sep = "/"
),
times = 5
) %>%
httr::content(as = "raw") %>%
rawConnection()
# read connection
api_response <- response_connection %>%
gzcon() %>%
readRDS()
# close connection
close(response_connection)
# Return OpenCPU return object
return(api_response)
}
error:
Error: child process has died
In call:
tryCatch({
if (length(priority))
setpriority(priority)
if (length(rlimits))
set_rlimits(rlimits)
if (length(gid))
setgid(gid)
if (length(uid))
setuid(uid)
if (length(profile))
aa_change_profile(profile)
if (length(device))
options(device = device)
graphics.off()
options(menu.graphics = FALSE)
serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
old_class <- attr(e, "class")
structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))
I was able to get it to run after restarting, but there does seem to be instability. I can't exclude that this instability exists in the feather version, it might be just a coincidence it happened during rds testing.
Tried Google protoBuff as implemented in {protolite}
: 45117abd081661f30c091f99a8adbe7ebb3535a2
As explained in the arrow FAQ: https://arrow.apache.org/faq/#how-does-arrow-relate-to-protobuf, protobuff is less suitable for large file transfers. I also noticed it filling my swap file, and crashing on large dataset transfers.
I feel this avenue is not worth it. I'll stick to Apache Arrow
Currently the
get_val()
helper supports fetching from OpenCPU as JSON or RDS.In #323 , Stijn found that we are crashing the session by running out of memory, possibly on a serialisation process. I believe
base::writeRDS()
->base::serialize()
might be the cause of this memory usage. Assuming the crash happens on the writing of the object as RDS to the output stream. I've not been able to replicate this issue locally or on the RStudio Server.There are a few open issues on opencpu for child processes that died:
I did a quick local test on a deployments table to see if outputting as feather or parquet might help:
It looks like both are faster on my system, I have not benchmarked memory usage yet.
This is using lz4 compression for feather, snappy for parquet and gzip for rds.
Stijn proposes using an alternative fetch method; returning the session id and writing out paged result objects to the session dir, then having the client fetch these objects and serializing on the client. This ties in to an existing paging branch, but Stijn mentioned this will probably require some optimisation on the database so we have a nice column to sort on.
to benchmark:
Blockers / Action Points
[ ] Install
arrow
on Lifewatch RStudiogcc
>7 currently 5.4.0[ ] Implement query paging on
etnservice
Optional:
[ ] Implement chunked writing to file on
etnservice
[ ] Switch to using File API instead of Object API for fetching file to client