ropensci / ruODK

ruODK: An R Client for the ODK Central API
https://docs.ropensci.org/ruODK/
GNU General Public License v3.0
42 stars 13 forks source link

ruODK stops downloading images attached to a form after a few of minutes #114

Closed dpagendam closed 3 years ago

dpagendam commented 3 years ago

Problem

We have an ODK Central Server up and running. One of the ODK forms contains a field to upload a photograph of a study site. ruODK works perfectly for us to pull our ODK Central Data into R, with one exception: downloading the attached images. When using ruODK to pull all of the data for the form with the images, it successfully starts downloading images, but then eventually, after approximately a couple of hundred images downloaded, it seems to time out and the R console returns the error:

..... ✔ File saved to "../www/images/1608611394304.jpg". Request failed [404]. Retrying in 1 seconds... Error: Problem with mutate() input trap_photo. x Not Found (HTTP 404). Failed to get desired response from server https://myserver.com as user "myusername".

Reproducible example

Unfortunately, I can't share my server address, username and password to reproduce the error, but hopefully the code below provides some insight into how the data is being extracted. If I set "download = FALSE" in odata_submission_get then all the forms minus the images download fine.

library(ruODK)
library(stringr)
library(tidyverse)
library(sp)
library(rgdal)
library(SDraw)

setwd("~/")

#Initialize the set up for ruODK
tz <- "Asia/Riyadh"
ruODK::ru_setup(
  svc = "https://myserver.com/v1/projects/3/forms/MyProject.svc",
  un = "myusername"
  pw = "mypassword"
  tz = tz,
  verbose = TRUE, # great for demo or debugging,
  url="https://myserver.com",
  retries = 10,
  odkc_version = "1.0"
)

#Store arrays
forms <- form_list()
fid_list <- forms$fid

#Loop through the list of fid's and assign the table to their name
forms$name <- c("Form1")
for(i in 1:length(fid_list)) {
  assign(str_replace_all(forms$name[fid_list[i] == forms$fid], " ", "_"),
         eval(parse(text=paste0("odata_submission_get(fid=\"", forms$fid[fid_list[i] == forms$fid], "\", local_dir='../www/images', download = TRUE)")))
  )
}
Session Info R version 3.6.1 Patched (2019-08-07 r76935) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.16 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding locale: [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] SDraw_2.1.13 rgdal_1.4-4 sp_1.4-4 forcats_0.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 [10] ggplot2_3.3.2 tidyverse_1.3.0 stringr_1.4.0 ruODK_0.9.6 MASS_7.3-51.5 loaded via a namespace (and not attached): [1] nlme_3.1-140 fs_1.5.0 sf_0.9-6 lubridate_1.7.9.2 RColorBrewer_1.1-2 httr_1.4.2 tools_3.6.1 [8] backports_1.1.10 R6_2.5.0 AlgDesign_1.2.0 rpart_4.1-15 KernSmooth_2.23-15 rgeos_0.5-1 Hmisc_4.2-0 [15] DBI_1.1.0 colorspace_2.0-0 nnet_7.3-12 withr_2.3.0 tidyselect_1.1.0 gridExtra_2.3 curl_4.3 [22] compiler_3.6.1 cli_2.2.0 rvest_0.3.5 htmlTable_1.13.1 xml2_1.3.2 spsurvey_4.1.4 keras_2.2.5.0 [29] scales_1.1.1 checkmate_1.9.4 classInt_0.4-3 tfruns_1.4 crossdes_1.1-1 digest_0.6.27 foreign_0.8-71 [36] base64enc_0.1-3 pkgconfig_2.0.3 htmltools_0.5.0.9003 dbplyr_1.4.2 htmlwidgets_1.5.3 rlang_0.4.9 readxl_1.3.1 [43] rstudioapi_0.13 generics_0.1.0 jsonlite_1.7.2 gtools_3.8.1 tensorflow_2.0.0 acepack_1.4.1 magrittr_2.0.1 [50] Formula_1.2-3 Matrix_1.2-17 Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 reticulate_1.13 lifecycle_0.2.0 [57] stringi_1.5.3 whisker_0.4 snakecase_0.11.0 grid_3.6.1 parallel_3.6.1 crayon_1.3.4 deldir_0.1-23 [64] lattice_0.20-38 haven_2.2.0 splines_3.6.1 hms_0.5.3 zeallot_0.1.0 knitr_1.30 pillar_1.4.7 [71] clisymbols_1.2.0 reprex_0.3.0 glue_1.4.2 latticeExtra_0.6-28 data.table_1.12.8 modelr_0.1.5 vctrs_0.3.5 [78] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 xfun_0.19 janitor_2.0.1 broom_0.5.4 e1071_1.7-4 [85] class_7.3-15 survival_2.44-1.1 units_0.6-7 cluster_2.1.0 ellipsis_0.3.1 ```{r} # utils::sessionInfo() ```

Thanks you for providing this excellent package and thanks in advance for any insights into what might be causing this issue.

florianm commented 3 years ago

Hi @dpagendam, thanks for the detailed error report, and great to see ruODK used here in Oz :-) Apologies for the wording in the issue template - it should mention not so share secrets. Tracked at #115

By and large, ruODK seems to work without big surprises. Possibly related to this issue:

I've got one form with 40k+ submissions where some photos in a particular nested repeat sometimes don't download. These are the steps I'll try (feel free to try the same on your end and report your findings back here):

florianm commented 3 years ago

I've pushed a minor patch to let attachment_get skip downloading attachments with blank filenames.

I'll have to test whether attachment downloads in nested tables work as expected. I can reproduce a situation where attachments in a repeated form group do not download at all. In contrast, the vignette on OData manages to download attachments to repeated form groups (nested tables) completely fine.

Edit: There are gremlins in ruODK's handling of attachments, working on a fix. At this point, this issue looks more likely to be a bug in ruODK rather than data loss in ODK Central.

florianm commented 3 years ago

@dpagendam I've pushed a bugfix to make sure ruODK downloads all attachments from both the main "Submissions" table and any nested subtables ("Submissions.GROUP_NAME"). I'm verifying the bugfix with a full data ETL run later today.

Could you re-install ruODK from latest main branch again (v 0.9.7) and see whether you can re-create the download timeout issue? A second guess would be to increase swap and RAM for your server, maybe there's some congestion happening on the disk.

florianm commented 3 years ago

Reopening this issue. I'm getting spurious timeouts on one of my production forms on an ODK Central v0.6 instance. I'll modify downloading the attachments to tolerate timeouts and emit a warning message. If the timeouts come from the server rather the attachment, re-running the download could resolve such timeouts.

I can see that some of my attachments do not exist on ODK Central:

curl --include https://odkcentral.dbca.wa.gov.au/v1/projects/1/forms/build_Site-Visit-Start-0-3_1559789550/submissions/uuid:fcf3d82a-3276-44d9-9b36-5f43ac460692/attachments/1597722025465.jpg -u EMAIL -p --output file.jpg
Enter host password for user 'EMAIL':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current Dload  Upload   Total   Spent    Left  Speed
100    76  100    76    0     0     96      0 --:--:-- --:--:-- --:--:--    96
~> cat file.jpg
HTTP/2 404 
server: nginx
date: Mon, 08 Mar 2021 09:56:08 GMT
content-type: application/json; charset=utf-8
content-length: 76
x-powered-by: Express
etag: W/"4c-ShTrjQvXJDA49Wfxw7ES7C6cQFg"
strict-transport-security: max-age=63072000

{"message":"Could not find the resource you were looking for.","code":404.1}
# This is one of my famous 76B files

~> curl --include https://odkcentral.dbca.wa.gov.au/v1/projects/1/forms/build_Site-Visit-Start-0-3_1559789550/submissions/uuid:fcf3d82a-3276-44d9-9b36-5f43ac460692/attachments/ -u EMAIL -p
Enter host password for user 'EMAIL':
HTTP/2 200 
server: nginx
date: Mon, 08 Mar 2021 09:56:27 GMT
content-type: application/json; charset=utf-8
content-length: 45
x-powered-by: Express
etag: W/"2d-HVcqTtBpy7VYRhG7AXUxWyO/jck"
strict-transport-security: max-age=31536000
x-content-type-options: nosniff
x-ua-compatible: chrome=1
strict-transport-security: max-age=63072000

[{"name":"1597722025465.jpg","exists":false}]
# The riddle's solution: this file has a filename (a photo was taken in ODK Collect, 
# but the file does not exist as far as ODK Central is concerned. Was this an upload error between ODK Collect and ODK Central?

The main branch of ruODK is now robust against missing attachment files without the overhead of the extra API call to test for the attachments' existence.

dpagendam commented 3 years ago

Hi @florianm,

thanks for all your help with this! (and sorry to be so slow to respond). I downloaded the zip file from ODK central and couldn't find any evidence of corrupted image files or files that weren't valid images. I have just reinstalled THE latest version of ruODK from Github and reinstalled in R and things seem to re downloading now without any issue, so I think the fixes that you have applied have resolved my problem. This is a really wonderful package and I am very grateful for the support!

Regards,

Dan

florianm commented 3 years ago

Aw man, great to hear! Thanks again for the bug report, it reminded to fix attachment downloads from nested sub-tables. I'll close this for now, feel free to re-open the issue if you find ruODK misses any attachments.