How to handle long polling process

schloerke commented 5 years ago

Copying an email thread here to continue discussion

@mftokic - Oct 7, 2019

I’m trying to integrate the Plumber API with Microsoft Flow, and would like to run an asynchronous call to my API. For this to work in flow I will need to have a response status of 202 (Accepted) along with a location header containing a URL to check on the status of the request.

I was wondering if you could help me with those two responses?

For the 202 response, I wrote the below code. Does this look right to respond back to the API call that it is accepted?
res$status <- 202
For the header response with location URL, I wrote the below code. How can I change this to have the URL dynamically created so the MS Flow can poll that URL to see if the API call (and r script) is still running?
res$setHeader("location", "")
Thanks for your help!

@schloerke

I would like to make sure I understand your situation correctly.

You would like to visit the API to start calculations

You would like to poll a url while the calculations are working

You would like to redirect the poll to a response if the calculations are ready. Is this process long? Such as 1 hr (much longer than a browser connection)? Or is it more like 30 seconds?

@mftokic

Yes that’s correct. Calling the API will kick off a long running forecasting process (using statistical time series and ML models) that will easily take over an hour. I’m trying to implement the API call in Microsoft Flow, where there is a timeout limitation of 2 minutes. You can make asynchronous calls for longer running calls, where Flow will call a location header URL to see if it is still processing or finished (please see below picture and link for more details).

https://docs.microsoft.com/en-us/azure/logic-apps/logic-apps-create-api-app#webhook-actions

@schloerke

I've attached a file below. This should implement that idea of a "long poll" process.

If you change the response status to 307, it will work great in the browser as a proof of concept.

`tokic.R`

```r # run in R session: plumber::plumb("tokic.R")$run(port = 12345) # visit in browser: 127.0.0.1:12345/begin work_queue <- list() #' @get /begin function(req, res) { # get unique id while( (id <- paste0(sample(letters, 8, replace = TRUE), collapse = "")) %in% names(work_queue) ) { TRUE } # initiate work in separate thread work_queue[id] <<- list(NULL) later::later( function() { idx <- sample(1:3, 1) work_queue[[id]] <<- list(iris, mtcars, Titanic)[[idx]] }, 10 # wait 10 seconds ) # redirect to status res$status <- 202 res$setHeader("Location", paste0("/status/", id)) # res$setHeader("retry-after", 2) # didn't know if it was seconds or milliseconds id } #' Poll the work queue #' #' @html #' @get /status/ function(req, res, id) { # yell if unknown id if (! (id %in% names(work_queue))) { stop("unknown id '", id, "' in work queue. ", paste0(names(work_queue), collapse = ", ")) } if (is.null(work_queue[[id]])) { # not ready yet # display status of id as.character(htmltools::tagList( # if watching on a browser, refresh the webpage every second htmltools::tags$script( htmlwidgets::JS( "setTimeout(function() { window.location.reload() } , 1 * 1000);" ) ), Sys.time(), htmltools::tags$br(), "id: ", id )) } else { # data ready! # redirect to answer of id res$status <- 202 res$setHeader("Location", paste0("/answer/", id)) # helpful note paste0("redirecting to : /answer/", id) } } #' @get /answer/ #' @json function(id) { # yell if unknown id if (! (id %in% names(work_queue))) { stop("unknown id '", id, "' in work queue") } # return data work_queue[[id]] } ```

@mftokic

This is great, thank you! I’ve been playing around with it the past day and have learned a ton, like how you can implement the “later” function to allow for immediate responses in API calls. I have a few questions around the process you called out in the example.

When running your example script within a machine learning (ML) workflow, the “status” endpoint will not respond immediately when called after 1-2 trys, because R is busy running the ML code after the 10 second delay. I think this can be fixed if I moved my API into a more scalable hosting environment (instead of running just a single instance on one core). I’m trying to build a POC using a Windows Server, and think the best way to host it on windows would be with PM2. I’ve tried following the examples in the documentation, but it’s hard to implement correctly since the examples were meant for a linux server, not a windows server. I think the issue is around correctly defining the custom interpreter in PM2 to use R instead of the default JS/Python/Perl.

Do you have any advice on how to implement the R-Based wrapper for Windows? I believe the below example is a bash script (first line of code), that will tell PM2 to use the RScript application when running the plumber R file. If you think there is a better way to host the API on a windows server, please let me know! I’d like to use RStudio Connect or digital ocean, but I will need to first provide a working proof of concept on existing servers we have before I can convince senior leadership to spend additional dollars. It’s also hard to convince people right off the bat to start using a non-microsoft product before anything gets developed 😊.

mitokic commented 5 years ago

Thanks Barret for posting to Gihub. I'd appreciate any advice on hosting a plumber API on a windows server (based on my most recent post above) where it can leverage all available cores to allow parallel calls. Thanks for your help!

schloerke commented 5 years ago

When running your example script within a machine learning (ML) workflow, the “status” endpoint will not respond immediately when called after 1-2 trys, because R is busy running the ML code after the 10 second delay. I think this can be fixed if I moved my API into a more scalable hosting environment (instead of running just a single instance on one core).

I've found that staying within R as much as possible works the smoothest.

I would look into using callr::r_bg. Assuming your ML process is in R, callr can launch a background process. If you save the result of proc <- callr::r_bg(.....), you can inspect it to see if the process has finished (proc$is_alive()) and get its result (proc$get_result()). The result could be a file or the actual R data returned from the model. I suggest writing the results to disk within the background process to make retrieval possible even if the R process dies.

Have a background R process for each new job could be a blessing and a curse as you will launch an independent R process for each new job. This can be bad if you launch too many processes. However, you could add queueing logic into your API to prevent machine overload.

Do you have any advice on how to implement the R-Based wrapper for Windows?

@shrektan will have more advice here, but I strongly recommend running your plumber instance within Docker. While plumber may work on windows, we do not actively support it.

shrektan commented 5 years ago

Hosting plumber on Windows by using the R session directly is OK. But need to remember that fork is not supported on Windows. So for parallel computing, R objects copying is inevitable. As of scaling-up, I'm certain there're ways to do but it's beyond my knowledge and for me, it's limited in "async/parallel computing on the local machine".

Docker is natively supported by Windows 10. It may be a better option because Docker containers are easier to be managed and scaled. In addition, fork is supported on Linux. With all these matured tools in the Docker ecosystem, I think it's easier to find whatever you need. Moreover, it's important to be able to duplicate the deployment's environment, which is not difficult on Windows while Docker is born for this.

r2evans commented 4 years ago

I don't have enough experience personally with callr and plumber in long-running processes, so this might be just a matter of "test it and see".

Using callr::r_bg seems reasonable. What happens if/when the main R process is interrupted?

Use-case: if an API endpoint accepts some unique ID as a parameter (for which only one execution should occur at a time), then we can store that id within your work_queue list. Subsequent calls with the same ID can then self-determine that they don't need to start a new one, optionally redirecting to a "status" end point as you've already demonstrated above.

Then ... either the main R process crashes or is restarted (e.g., updated deployment), my guess is that the child processes would either be interrupted (killed) or orphaned (output goes nowhere).

I'd think that the "orphaned" outcome is a matter of what the bg process does ... if it works in side-effect (e.g., insert data into a db), then it will likely do its thing but nothing is notified on exit unless/until somebody checks to see if the side-effect is done (e.g., queries the db). However, if in the meantime another caller tries to start this API with that same ID, then ... it is started again.

Any thoughts to external IPC? I can see utility in filesystem or nosql (redis?) based operations, where we might still be able to use callr::r_bg but its information is actually stored elsewhere ... in which case, it might be possible to different endpoints to all be able to "know" about that process.

Thoughts?

schloerke commented 4 years ago

I believe the original intent was to turn a long execution process into something that can be inspected for a status and result. I believe the trick is to offload the work to someplace other than the main R thread.

There are definitely many different approaches and considerations to be aware of when offloading processing to somewhere other than the main R thread. (Similar to the communication issues that can occur with distributed databases as compared to accessing a data.frame().)

Going to close the issue for now as plumber will not implement any solution that generally solves this.

rstudio / plumber

How to handle long polling process #497