Downloading data: child process has died

peterdesmet commented 1 month ago

I got the following error when trying to download the largest dataset I know:

> download_acoustic_dataset(animal_project_code = "2013_albertkanaal")
Downloading data to directory `2013_albertkanaal`:
* (1/6): downloading animals.csv
* (2/6): downloading tags.csv                                                                   
* (3/6): downloading detections.csv                                                             
Error: child process has died

In call:
tryCatch({
    if (length(priority)) 
        setpriority(priority)
    if (length(rlimits)) 
        set_rlimits(rlimits)
    if (length(gid)) 
        setgid(gid)
    if (length(uid)) 
        setuid(uid)
    if (length(profile)) 
        aa_change_profile(profile)
    if (length(device)) 
        options(device = device)
    graphics.off()
    options(menu.graphics = FALSE)
    serialize(withVisible(eval(orig_expr, parent.frame())), NULL)
}, error = function(e) {
    old_class <- attr(e, "class")
    structure(e, class = c(old_class, "eval_fork_error"))
}, finally = substitute(graphics.off()))

This type of time-outs is expected when using the API. Is there an option to catch they and suggestion something more helpful?

PietrH commented 1 month ago

I get exactly the same error, at the same stage. I suspect the failure is actually at:

get_acoustic_detections(animal_project_code = "2013_albertkanaal")

I'm getting HTTP 502 request failed returns on the above call, this might be fixed with paging in https://github.com/inbo/etn/tree/paging:

https://github.com/inbo/etn/blob/801230693f62c660a86c4344708dee55639b4d2b/R/utils.R#L99-L159

Paging comes at a significant cost, not only the IO operations, but having to either rely on the parsing of readr, or having to store the mapping somewhere and reapplying it. It would like to avoid having to COUNT the size of a return object before paging or not paging, and I think leaving the choice up to the user is not so friendly either.

I'm thinking about it. In any case, this might have to be fixed on the etnservice side.

etn::get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE) does work

PietrH commented 1 month ago

Because it's a gateway error, I've contacted Stijn to see what he can see on his side.

I don't think the object is too big to pass over the API, especially compressed. I don't think server side paging will fix this, but client side paging might, altough with a very significant overhead (because we'd need to implement sorting, or maybe use R sessions to fetch from OpenCPU..)

PietrH commented 1 month ago

502 errors are due to Nginx (opencpu-cache), forwarded information to Stijn. We'll need a look into the admin logs for more info.

Stijn-VLIZ commented 1 month ago

I tried many different things, but my conclusion is that we are running into limits here. This image shows the memory usage of a local docker, running etnservice. The function get_acoustic_detections, is altered so no ordering is being done and the dataframe is emptied before being serialized. So the memory is for doing the query only. This also runs for 9 minutes. What I propose is indeed pagination, I will first investigate the possibilities on our side (db).

PietrH commented 1 month ago

If this is the case, why does the query work when using a local database connection? get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE)

Stijn-VLIZ commented 1 month ago

That's a question on how OpenCPU works. So opencpu starts a new R session on the server and then runs the ent package. Also It creates a session of it's own where opencpu stores information about your request, and your result. How and why it impacts the memory so hard, I don't now.

Their might be a solution in a async worker doing the query and writing it to file and than returning that. In that case you can use the async endpoint and check when the data is ready.

PietrH commented 1 month ago

I'm not sure if OpenCPU supports async requests. I agree that async requests would be the best solution for big datasets.

Currently, if I make changes I'm directly working on a live environment that has some (beta) users. Especially towards the future, how can I experiment with fixes without effecting the live api that people are using?
Is it possible the request succeeds on a local database connection simply because the RStudio Server has much more memory? In this case, optimizing the query might be the answer.

inbo / etn

Downloading data: child process has died #323