ropensci / ruODK

ruODK: An R Client for the ODK Central API
https://docs.ropensci.org/ruODK/
GNU General Public License v3.0
42 stars 13 forks source link

form_schema_ext: Possibly incorrect choices when translation is present #102

Closed dmenne closed 3 years ago

dmenne commented 3 years ago

I am using form_schema_ext by @mtyszler, from the code in the current ruODK repository.

My xlsx-file has the following choices

list_name name label::Deutsch (de) label::English (en)
janein ja Ja yes
janein nein Nein no
wertsachen natel Natel Mobile
wertsachen portemonnaie Portemonnaie Wallet
wertsachen hörgerät Hörgerät Hearing Aid
wertsachen brille Brille Glasses

This gives in choices_deutsch:

natel, portemonnaie, hörgerät, brille, Natel, Portemonnaie, Hörgerät, Brille

and for choices_english natel, portemonnaie, hörgerät, brille, Mobile, Wallet, Hearing Aid, Glasses

and for choices always NA. Is this as designed?

The format makes parsing ugly, because you have to split the entries in the middle and rejoin them. Or did I miss some auxillary function to untangle this?

I would have expected that this would give in column choices: natel, portemonnaie, hörgerät, brille

and Natel, Portemonnaie, Hörgerät, Brille, Mobile, Wallet, Hearing Aid, Glasses in choices_<language>.

mtyszler commented 3 years ago

Hi @dmenne . First of all, thank you for trying the new form_schema_ext() function. Let me reply to your questions and highlight potential modifications:

1. NA return values:

This gives in choices_deutsch:

natel, portemonnaie, hörgerät, brille, Natel, Portemonnaie, Hörgerät, Brille

and for choices_english natel, portemonnaie, hörgerät, brille, Mobile, Wallet, Hearing Aid, Glasses

and for choices always NA. Is this as designed?

choices in this case will indeed be NA by design. As indicated in the function reference, the return value includes:

My logic was the following: there is always at least 1 label language and one _choicelabel language, which is the default. Other languages come additionally. Often, is indeed the case that when multiple languages are included, the default language is excluded.

→ Here one could enforce a rule that when default label is fully NA and other languages are provided, the default label gets dropped (and similarly to choices)

2. Parsing:

The format makes parsing ugly, because you have to split the entries in the middle and rejoin them. Or did I miss some auxillary function to untangle this?

Here I'm not sure I get you. The choice_lang returns a list of lists. Based in the example provided in the function reference, I have:

fsx <- form_schema_ext()

fsx[fsx$name == "test_yn", "choices_english_(en)"]
# # A tibble: 1 x 1
#   `choices_english_(en)`
#   <list>                
# 1 <named list [2]>     

fsx[fsx$name == "test_yn", "choices_english_(en)"][[1]]
# [[1]]
# [[1]]$values
# [1] "0"  "1"  "99"

# [[1]]$labels
# [1] "Yes"   "No"    "Maybe"

# save values:
test_yn_VALUES <- fsx[fsx$name == "test_yn", "choices_english_(en)"][[1]][[1]]$values
test_yn_VALUES
# [1] "0"  "1"  "99"

# save labels:
test_yn_LABELS_ENG <- fsx[fsx$name == "test_yn", "choices_english_(en)"][[1]][[1]]$labels
test_yn_LABELS_ENG 
# [1] "Yes"   "No"    "Maybe"

So from above you have the values and the labels extracted. You could use that to parse your data from values into labels. Continuing the example from above:

data <- odata_submission_get(download = FALSE)

# show content of test_yn
data$test_yn
# [1] "1"  "99"

# convert to labels:
data$test_yn_labels <- factor(data$test_yn, levels = test_yn_VALUES, labels = test_yn_LABELS_ENG)
data$test_yn_labels
# [1] No    Maybe
# Levels: Yes No Maybe

→ Here one could simplify somehow. Maybe the list of lists can be a bit easier to access (please suggest how). I'm not so much in favor of having choice containing the values/names and the _choicelang the translations for the reason the that when only one language is provided, you would still need to have at least two columns. You do have a point in the sense that choice values are the same regardless of the language. My logic was to keep them together in a single object.

Path forward

I'm curious to learn what @florianm thinks about it. I'm happy to make changes that increase usability of the function.

florianm commented 3 years ago

@dmenne thanks for the issue report and @mtyszler thanks for the thoughts!

The blank choices in translated forms indeed are a design choice of ODK Central. I'm leaning towards preserving ODK Central behaviour wherever feasible but am open to suggestions or contributions to examples / vignettes / functionality which improve the user experience.

If the issue ultimately stems from ODK Central behaviour, it's always good to take the discussion to the core ODK team via the ODK forum. There is a possibility that this could lead to changes to API endpoints (e.g. content of default choices for translated forms) or new API endpoints. I have suggested the addition of translations to the JSON form schema: https://forum.getodk.org/t/include-form-field-labels-and-hints-in-form-schema/27737

MArcelo's examples would make a great addition to a "translations" vignette working through the common use cases.

mtyszler commented 3 years ago

Hi @florianm , tks for the feedback. Some very quick reactions from my side:

The blank choices in translated forms indeed are a design choice of ODK Central.

That is sort of the case. But to be precise, form_schema_ext traverses the XML form and enforces a blank label if only translated labels are available, and similarly it enforces blank _choicelabels if choices are available but only translated _choicelabels are available.

You can see this in the code:

# lines 152-157
 # initialize dataframe
  extension <- data.frame(
    path = character(0),
    label = character(0),
    stringsAsFactors = FALSE
  )

This will initialize a label column, which will be empty if only translations are provided. Similarly:

# lines 267-274
# check if 'choices' column already exist
        if (!("choices" %in% colnames(extension))) {

          # if not, create new column
          extension <- cbind(extension, data.frame(
            choices = rep(NA, nrow(extension))
          ))
        }

Similarly will initialize a choices column, which will contains empty labels if only translations are provided.

So for both cases one could enforce dropping the column if fully empty. The drawback of this, from a user point of view, is that you do not know in advance which columns to expect. On the other hand, you never which translations to expect.

Another point here is that @dmenne is also questioning the content format of choices or any choices_lang. As of now it is a list of lists, and we could potentially change to have a column choice_values with the (translation independente XML form _choice names) and a columnchoice_labels_langfor each available translation containing **only** the _choice labels_. Yet, all of thesechoice` columns will contain a list of its items. As I mentioned above, for me it just seemed useful to have values and labels in a single element.

MArcelo's examples would make a great addition to a "translations" vignette working through the common use cases.

I certainly agree that improving documentation is always good.

Let me know

dmenne commented 3 years ago

Sorry, mates, mea culpa. I had extracted the extra function to understand how it works and have used it incorrectly - I did not check more carefully, because there was no error.

Looks fine

florianm commented 3 years ago

Glad it's resolved! Dieter, Marcelo, would any change in documentation improve the user experience?

dmenne commented 3 years ago

No, I think it is fine. My error was that I did not take into account the magic of dplyr::left_join with nested lists, trying to replace it with merge. I strictly avoid using tidyverse in packages (not in one-shot report), because the maintenance of the ever-changing tidyverse in my CRAN package is not funny.

dmenne commented 3 years ago

@mtyszler : I would like to use parts of your function in my own package odkapi which implements the REST Api on a lower level than ruODK, replacing the last line (tidyverse) and the ruODK calls by my own calls.

I would make you a contributor to the package and cite your name in code. Is that ok?

mtyszler commented 3 years ago

Glad it's resolved! Dieter, Marcelo, would any change in documentation improve the user experience?

I think the base documentation is ok, in it is clear on what to expect. There's nothing coming out of the function that is not acknowledged by the function help.

However, adding something like this to the example section could be helpful:

# save values:
test_yn_VALUES <- fsx[fsx$name == "test_yn", "choices_english_(en)"][[1]][[1]]$values
test_yn_VALUES
# [1] "0"  "1"  "99"

# save labels:
test_yn_LABELS_ENG <- fsx[fsx$name == "test_yn", "choices_english_(en)"][[1]][[1]]$labels
test_yn_LABELS_ENG 
# [1] "Yes"   "No"    "Maybe"

In a vignette, adding something along the lines of the example below can be useful. I believe the main use of the choices columns from form_schema_ext is to convert choice values into choice labels for (summary) tables and graphs.

data <- odata_submission_get(download = FALSE)

# show content of test_yn
data$test_yn
# [1] "1"  "99"

# convert to labels:
data$test_yn_labels <- factor(data$test_yn, levels = test_yn_VALUES, labels = test_yn_LABELS_ENG)
data$test_yn_labels
# [1] No    Maybe
# Levels: Yes No Maybe
mtyszler commented 3 years ago

@mtyszler : I would like to use parts of your function in my own package odkapi which implements the REST Api on a lower level than ruODK, replacing the last line (tidyverse) and the ruODK calls by my own calls.

I would make you a contributor to the package and cite your name in code. Is that ok?

Sure! 👍

Citation and re-use is at essence of science and open-source.

florianm commented 3 years ago

Is odkapi online anywhere? Would love to take a look at it.

dmenne commented 3 years ago

Not yet on github (https://github.com/dmenne). I am still working on the documentation and doing the cleanup. There are tests against the apiary server, and against a saved mock of the apiary server when no connection is available using httptest, which works nicely but fails sometimes. And a set of live workflow test requiring a server, which have cost most of the work, because the apiary server has some issues, and I needed some time until I understood the phase transitions between draft and published.

I will send you a zip .

florianm commented 3 years ago

Nice, thanks! I haven't used httptest yet or experienced the issues you ran into with it, but did you try rOpenSci/vcr for HTTP mocking and testing?

dmenne commented 3 years ago

No issues with httptest, it works nicely with the exception of some things I do not expect that could work with capture, e.g. workflow tests with random names to avoid collisions.

The issues were with the combination of covr/testthat version 3, but as I saw on github this is "in work".

vcr: I had looked at it, it was nice. I do not exactly remember why I choose httptest, I believe it was because I found more apps using it. Could be an incorrect impression.