Open peterdesmet opened 1 month ago
I get exactly the same error, at the same stage. I suspect the failure is actually at:
get_acoustic_detections(animal_project_code = "2013_albertkanaal")
I'm getting HTTP 502 request failed returns on the above call, this might be fixed with paging in https://github.com/inbo/etn/tree/paging:
https://github.com/inbo/etn/blob/801230693f62c660a86c4344708dee55639b4d2b/R/utils.R#L99-L159
Paging comes at a significant cost, not only the IO operations, but having to either rely on the parsing of readr, or having to store the mapping somewhere and reapplying it. It would like to avoid having to COUNT
the size of a return object before paging or not paging, and I think leaving the choice up to the user is not so friendly either.
I'm thinking about it. In any case, this might have to be fixed on the etnservice side.
etn::get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE)
does work
Because it's a gateway error, I've contacted Stijn to see what he can see on his side.
I don't think the object is too big to pass over the API, especially compressed. I don't think server side paging will fix this, but client side paging might, altough with a very significant overhead (because we'd need to implement sorting, or maybe use R sessions to fetch from OpenCPU..)
502 errors are due to Nginx (opencpu-cache), forwarded information to Stijn. We'll need a look into the admin logs for more info.
I tried many different things, but my conclusion is that we are running into limits here. This image shows the memory usage of a local docker, running etnservice. The function get_acoustic_detections, is altered so no ordering is being done and the dataframe is emptied before being serialized. So the memory is for doing the query only. This also runs for 9 minutes. What I propose is indeed pagination, I will first investigate the possibilities on our side (db).
If this is the case, why does the query work when using a local database connection? get_acoustic_detections(animal_project_code = "2013_albertkanaal", api = FALSE)
That's a question on how OpenCPU works. So opencpu starts a new R session on the server and then runs the ent package. Also It creates a session of it's own where opencpu stores information about your request, and your result. How and why it impacts the memory so hard, I don't now.
Their might be a solution in a async worker doing the query and writing it to file and than returning that. In that case you can use the async endpoint and check when the data is ready.
I'm not sure if OpenCPU supports async requests. I agree that async requests would be the best solution for big datasets.
I got the following error when trying to download the largest dataset I know:
This type of time-outs is expected when using the API. Is there an option to catch they and suggestion something more helpful?