ropensci / ruODK

ruODK: An R Client for the ODK Central API
https://docs.ropensci.org/ruODK/
GNU General Public License v3.0
42 stars 13 forks source link

odata_submission_get missing geodata (only first coordinate present) #95

Closed TimonWeitkamp closed 3 years ago

TimonWeitkamp commented 3 years ago

Problem

Following the example of ruODK spatial, I manage to download the data from the ODK Central server, with odata_submission_get(wkt=TRUE), and I can manage to make it an sf through st_as_sf(wkt="polygon column") with no errors.

I want to view the polygons through leaflet() or mapview(), but I get the following errors.

leaflet() %>% addTiles() %>% addPolygons(data = geo_sf_poly) Error in if (length(nms) != n || any(nms == "")) stop("'options' must be a fully named list, or have no names (NULL)") : missing value where TRUE/FALSE needed

mapview::mapview(geo_sf_poly) Error in CPL_get_bbox(obj, 2) : Not a matrix.

If I use the ruODK data (data("geo_wkt", package = "ruODK")), all works as expected, just like in the example.

So I then took a closer look at the polygon column, and I can see the values of only the first xyz coordinate of the polygon. For the data_wkt:

POLYGON ((33.6647630855441 -25.037900257032046 0))

For the sf:

list(c(33.6647630855441, -25.037900257032, 0))

If I download the CSV file manually from the server, and upload the file to QGIS through ODKTrace2wkt, there are no problems, I can see the polygons; so it is not an error on the data collection side. Somewhere along the download, geodata is left behind.

Reproducible example

I don't have a reproducible example, other than the two data points I also mentioned above

For the data_wkt:

POLYGON ((33.6647630855441 -25.037900257032046 0))

For the sf:

list(c(33.6647630855441, -25.037900257032, 0))

Session Info ```{r} > utils::sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041) Matrix products: default locale: [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C [5] LC_TIME=Dutch_Netherlands.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] sf_0.9-5 mapview_2.9.0 ggplot2_3.3.2 leaflet_2.0.3 dplyr_1.0.2 ruODK_0.9.1.9002 loaded via a namespace (and not attached): [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 base64enc_0.1-3 class_7.3-17 remotes_2.2.0 tools_4.0.2 [8] digest_0.6.25 gtable_0.3.0 satellite_1.0.2 lifecycle_0.2.0 tibble_3.0.3 lattice_0.20-41 pkgconfig_2.0.3 [15] png_0.1-7 rlang_0.4.7 DBI_1.1.0 rstudioapi_0.11 crosstalk_1.1.0.1 e1071_1.7-3 withr_2.2.0 [22] stringr_1.4.0 httr_1.4.2 raster_3.3-13 generics_0.0.2 vctrs_0.3.4 htmlwidgets_1.5.1 webshot_0.5.2 [29] stats4_4.0.2 classInt_0.4-3 grid_4.0.2 tidyselect_1.1.0 glue_1.4.2 R6_2.4.1 sp_1.4-2 [36] purrr_0.3.4 magrittr_1.5 scales_1.1.1 codetools_0.2-16 ellipsis_0.3.1 htmltools_0.5.0 units_0.6-7 [43] colorspace_1.4-1 KernSmooth_2.23-17 stringi_1.4.6 munsell_0.5.0 leafem_0.1.3 crayon_1.3.4 ```
florianm commented 3 years ago

Hi Timon! Sorry, just returning from leave. Would you be able to upload your form to the ODK Central Sandbox and capture a record with it? Hoping we can reproduce the bug this way without compromising your own server/form.

Can you run odata_submission_get(parse=F) to see whether the coordinates are downloaded completely? I would have hoped that my unit tests prove that gepshape parsing works and won't lose data.

TimonWeitkamp commented 3 years ago

Hi, no worries :)

odata_submission_get(parse=F) does download some complete polygons, and some not (just like the example above). For example, ruODK result: $value[[245]]$Polygon [1] "POLYGON ((33.123963782563806 -18.933762134984136 739))"

QGIS result of same polygon: image

The strange thing is that for some records the complete polygons are downloaded with ruODK...

florianm commented 3 years ago

Hm, I'd need to access a failing record to diagnose further. This is a serious issue I'd like to fix, really appreciate the report and your help.

Can you provide a reproducible failing example I can interact with? Ideally your form on the ODK Central Sandbox with a record, short of asking for access to your current ODK Central instance which I feel would be intrusive.

florianm commented 3 years ago

Timon has shared the form with me, I've uploaded the form to the ruODK package test project on the ODK Sandbox. A first submission (added polygon vertices by manual tapping) seems to work fine: https://rpubs.com/florian_mayer/ruodk_issue95

@TimonWeitkamp could you accept the invite to the ODK Central Sandbox, set up ODK Collect with the QR code for app user "timon" you can get here and collect a few records? You should be able to run the code from https://rpubs.com/florian_mayer/ruodk_issue95 with your own ODK Central credentials (un, pw).

Feel free to email me privately an RData file with the downloaded unparsed submission(s).

TimonWeitkamp commented 3 years ago

I tested the survey in the sandbox as well (both walking with the phone and tapping manual points), and all of the polygons seem complete.

I will email you the RData file.

florianm commented 3 years ago

Thanks for the data files from your private project, and for submitting test data to your form on the ODK Central Sandbox.

The test data on the Sandbox seem fine: https://rpubs.com/florian_mayer/ruodk_issue95

In your privately shared data, I can see the issue:

# data unparsed from odata_submission_get(wkt=TRUE):
du <- readRDS("unparsed_data.rds")
# First record shows incomplete field "Polygon" (type geoshape)
du$value[[1]]$Polygon[1]
"POLYGON ((33.66333849728107 -25.04568258649638 0))"

du$value[[1]]$`__id`
[1] "uuid:141823e2-fc16-47b0-8b31-903e557f4a85"

# Same record in CSV/ZIP export shows five vertices in field Polygon: (linebreaks for readability)
-25.04568258649638 33.66333849728107 0.0 0.0;
-25.045986943913125 33.66352893412113 0.0 0.0;
-25.045775837801138 33.66398457437754 0.0 0.0;
-25.045500336184592 33.66380486637354 0.0 0.0;
-25.04568258649638 33.66333849728107 0.0 0.0

So the unparsed data is incomplete after ruODK::odata_submission_get(parse = FALSE). On ruODK's side, the data is a nested list, coming from the httr::content() of the response to the API call "v1/projects/{pid}/forms/{URLencode(fid, reserved = TRUE)}.svc/{table}". ruODK does not modify the data in any way.

That also means ruODK::odata_submission_rectangle()/split_geoshape() are unlikely to be the cause. I mention this as the geoshape/trace/point parsing involves a regex to discard trailing commas (ODK Central versions < 0.8).

Next questions to @TimonWeitkamp: What version of ODK Central are you using? I assume it's > 0.8? Can you reproduce the missing coordinates through odata_submission_get(wkt=FALSE) which will return GeoJSON?

TimonWeitkamp commented 3 years ago

I'm using version 1.0 indeed.

If I download the GeoJSON, I miss coordinates as well: image

florianm commented 3 years ago

Timon has provided me with access to his instance, which shows those errors in the form. Here's an excerpt showing that OData loses some geoshape data, whereas the CSV export and the RESTful submission_get do not.

Retrieving the data

For clarity, I show the code retrieving the same data in three separate ways.

 Option 1: via OData
data <- ruODK::odata_submission_get(
  download = FALSE, # we don't need attachments here
  table = ft$url[1],
  local_dir = loc,
  wkt = TRUE
)

data_raw_wkt <- ruODK::odata_submission_get(
  download = FALSE, # we don't need attachments here
  table = ft$url[1],
  local_dir = loc,
  wkt = TRUE,
  parse = FALSE
)

data_raw_gj <- ruODK::odata_submission_get(
  download = FALSE, # we don't need attachments here
  table = ft$url[1],
  local_dir = loc,
  wkt = FALSE,
  parse = FALSE
)

# Option 2: via ZIP export, set overwrite = TRUE to refresh download
data_csv_zip <- ruODK::submission_export(overwrite = FALSE)
data_csv_extracted <- unzip(data_csv_zip)
data_csv <- readr::read_csv(data_csv_extracted[[1]])

# Option 3: via REST
sl <- ruODK::submission_list()
sub_raw <- ruODK::submission_get(sl$instance_id)

Looking at the first record

The first submission already demonstrates the bug. We'll retrieve the instanceID to prove that the five different R objects contain the same record in their first row / list element.

R> data$id[[1]]
[1] "uuid:55673ddd-bc33-4919-9f42-f61370643e4b"
R> data_raw_wkt$value[[1]]$`__id`
[1] "uuid:55673ddd-bc33-4919-9f42-f61370643e4b"
R> data_raw_gj$value[[1]]$`__id`
[1] "uuid:55673ddd-bc33-4919-9f42-f61370643e4b"
R> data_csv$KEY[[1]]
[1] "uuid:55673ddd-bc33-4919-9f42-f61370643e4b"
R> sub_raw[[1]]$meta$instanceID[[1]]
[1] "uuid:55673ddd-bc33-4919-9f42-f61370643e4b"

Comparing the offending geoshape field

Now we'll look at the offending geoshape field named "Polygon".

# OData - missing data in both parsed and unparsed versions, both WKT and GeoJSON formats
R> data$polygon[[1]]
[1] "POLYGON ((33.72580546885729 -24.986248868859494 0))"
R> data_raw_wkt$value[[1]]$Polygon
[1] "POLYGON ((33.72580546885729 -24.986248868859494 0))"
R> data_raw_gj$value[[1]]$Polygon
$type
[1] "Polygon"

$coordinates
$coordinates[[1]]
$coordinates[[1]][[1]]
[1] 33.72581

$coordinates[[1]][[2]]
[1] -24.98625

$coordinates[[1]][[3]]
[1] 0

# CSV export - OK
R> data_csv$Polygon[[1]]
[1] "-24.986248868859494 33.72580546885729 0.0 0.0; -24.98602033783035 33.72611157596111 0.0 0.0; -24.985721301905958 33.72586816549301 0.0 0.0; -24.985910022833146 33.72559256851673 0.0 0.0; -24.986248868859494 33.72580546885729 0.0 0.0"

# RESTful submission_get: OK
R> sub_raw[[1]]$Polygon
[[1]]
[1] "-24.986248868859494 33.72580546885729 0.0 0.0; -24.98602033783035 33.72611157596111 0.0 0.0; -24.985721301905958 33.72586816549301 0.0 0.0; -24.985910022833146 33.72559256851673 0.0 0.0; -24.986248868859494 33.72580546885729 0.0 0.0"

The above output shows that the OData submission API returns only the first / last coordinate of the geoshape, while the other endpoints (CSV/ZIP export, RESTful submission_get) return the full record. The fact that the coordinates are already missing in the unparsed (raw) OData response shows that ruODK::odata_submission_rectangle and ruODK::handle_geoshape do not lose data themselves. One point goes in, one point comes out.

This seems to point towards the OData submission API endpoint, unless I have overlooked something. In contrast, the same form deployed to the ODK Central Sandbox with a handful of data collected by both Timon and me does not show that problem (yet).

florianm commented 3 years ago

The discussion on the ODK Slack chat indicates that the problem are whitespaces in the captured geoshapes.

Valid geoshapes contain ";" separated coordinate tuples.

"-24.986248868859494 33.72580546885729 0.0 0.0;-24.98602033783035 33.72611157596111 0.0 0.0;-24.985721301905958 33.72586816549301 0.0 0.0;-24.985910022833146 33.72559256851673 0.0 0.0;-24.986248868859494 33.72580546885729 0.0 0.0"

Invalid geoshapes have additional whitespaces after the ";":

"-24.986248868859494 33.72580546885729 0.0 0.0; -24.98602033783035 33.72611157596111 0.0 0.0; -24.985721301905958 33.72586816549301 0.0 0.0; -24.985910022833146 33.72559256851673 0.0 0.0; -24.986248868859494 33.72580546885729 0.0 0.0"

ODK Central does not post-process geoshapes on CSV/ZIP export, but does post-process them on OData export. ODK Central's OData geoshape/trace parser is likely to cut off coordinates after the whitespace, explaining the "only one coordinate" geoshapes.

@TimonWeitkamp as discussed via email, the CSV/ZIP export will provide you with the data in full fidelity, but repeat media file download. Could you provide details about the data collectors' devices and capture methods in the ODK Slack chat? Are you happy for me to close this issue seeing it's a bug between ODK Collect and Central?

TimonWeitkamp commented 3 years ago

I will provide the information in the Slack chat.

Thanks for the help, you can close this issue.