ropensci / ruODK

ruODK: An R Client for the ODK Central API
https://docs.ropensci.org/ruODK/
GNU General Public License v3.0
42 stars 13 forks source link

Selecting one submission only #72

Closed dmenne closed 4 years ago

dmenne commented 4 years ago

Is there a way to select one submission only, e.g. by using a mongo-style query?

dmenne commented 4 years ago

The docs say: "Currently, there are no paging or filtering options, so listing Submissions will get you every Submission in the system, every time."

Heul... Back to Kobo.

florianm commented 4 years ago

Hang on, there is a way! While the ODK Central docs indeed have some unfiltered endpoints, all of these are so fast (vs the overhead of individual HTTP requests), that you might want to download all, the filter locally.

The ODK Central API endpoint behind ruODK::submission_list() which I think you refer to returns only submission metadata, not the full submission data. So, not to worry, this endpoint is so fast that you can download all submission IDs (plus metadata) at once!

I'd say for your use case, ruODK offers two options. The third option, the full export as one ZIP (including attachments), is the opposite of what you're asking for here.

Option 1: OData. Being an API wrapper for the ODK Central API, ruODK is limited to filtering / sorting / limiting options implemented by ODK Central. Getting all submissions without attachments via odata_submittion_get(download=FALSE) is pretty quick. Downloading all at once, then processing / filtering locally is much faster than querying the dataset multiple times or downloading submissions subsets repeatedly. In one of my production use cases, 40k records (incl attachments) are downloaded, multiple forms are stored as one named list in R, which is then used as packaged data for an RShiny app, plus filtered programmatically (by capture location) and processed into HTML reports using one parameterised Rmd template. The OData Document API docs back this approach:

The $top and $skip querystring parameters, specified by OData, apply limit and offset operations to the data, respectively. The $count parameter, also an OData standard, will annotate the response data with the total row count, regardless of the scoping requested by $top and $skip. While paging is possible through these parameters, it will not greatly improve the performance of exporting data. ODK Central prefers to bulk-export all of its data at once if possible.

Option 2: RESTful API.

To help me cover your use case better, can I learn more about your use case? How many submissions are expected, how would you find the one submission you need to download, could you download all data and filter/select locally? How many requests need to be sent to how many forms?

dmenne commented 4 years ago

Thanks. Thinking along these lines. Keeping a local sqlite-copy of submissions with indexes on the right columns, and updating only those after the last creation data. Not very nice.

florianm commented 4 years ago

Sqlite - agree with the split between data clearinghouse (ODK Central) and local db. The reporting job sounds complex job with large / long-running components. Possibly a use case for drake? https://docs.ropensci.org/drake/

dmenne commented 4 years ago

I had hoped JSON would be reasonalbe enough to use https://carbone.io/

florianm commented 4 years ago

Carbone looks nice and even free for up to 100 renders a month! As an old-school R/Rmd monkey I'd download all records via odata_submission_get, create a parameterised Rmd with hard-coded field labels and hints

---
title: "Turtle Nesting Census preview"
date: "`r Sys.time()`"
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    fig_width: 10
    fig_height: 6
    code_folding: hide
    theme: lumen
vignette: >
  %\VignetteIndexEntry{turtleviewer}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
params:
  area_name: 'Port Hedland'
---

 {r setup, eval=T}
library(turtleviewer)
data("odkc")
x <- wastdr::filter_odkc_turtledata(odkc, area_name = params$area_name)

and finally crunch out one report per patient (my example has more turtles than patients)

#' Generate all reports - compile vignette turtledata for all areas
#' @export
generate_all_reports <- function() {
  data("turtledata", package="turtleviewer")
  for (a in c(unique(turtledata$areas$area_name), "Other")) {
    message(glue::glue("Compiling report for {a}..."))
    fn <- here::here("vignettes", glue::glue("{wastdr::urlize(a)}.html"))
    message(glue::glue("Rendering preview for {a} to {fn}..."))
   if (fs::file_exists(fn)) {# test here for whether a patient report already exists
    rmarkdown::render(
      "vignettes/turtleviewer.Rmd",
      params = list(area_name = a),
      output_file = fn
    )}
    message(glue::glue("Compiled {fn}."))
  }
  fs::dir_ls(here::here("vignettes"), glob = "*.html") %>% 
    fs::file_copy(here::here("inst/app/www/"), overwrite = TRUE)
  message("Copied reports to www/.")
}

skipping reports that already exist.

dmenne commented 4 years ago

Carbone.js is open source. Only the cloud version is payware, and using it is not possible due to security problems, even if it is EU based. For health-related stuff, everything must be behind local firewalls. RMD is my daily tool, so that's what I planned as a first solution. Problem is that each report must be written separately, but maybe I find a unifying solution for the not-too complex forms.

However, there a repeat pages with unknown number of repeats; I had used KOBO before, it displays these well, but the API simply forgets them.

florianm commented 4 years ago

Great to hear about Carbone.js, I dropped off at the pricing page. For repeating groups, odata_submission_get(table="Submissions.my_sub_group") gets you separate child-tables with records of repeats, plus the submission UUID as key to join.

With a parameterised Rmd, you can churn out a separate PDF per patient record if you use the patient ID as parameter and filter the data inside the Rmd by that ID. You could also loop over all records in child tables for a given patient ID inside the Rmd. Would that allow you to create your reports?

dmenne commented 4 years ago

Thanks for the ideas, just trying that now. Carbone would be better, because details of the form can be handled by non-programmer in Word.

One more problem, but not related to your package: To create the report of a patient the first time, I need a trigger; in KOBO, I can issue a REST call with the id of the new report.

If that does not work, I have to resort to scanning every 5 minutes.

florianm commented 4 years ago

AFAIK there's no subscription to new submissions in ODK Central. Short of a database trigger on ODK Central's PostgreSQL db, this should be raised as a feature request to the ODK Central repo. I'm sure there's an R package for cron jobs though, so scanning every few minutes wouldn't be that unfeasible.