ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Can't infer numeric domain from col_classes for interval data #296

Closed karawoo closed 4 years ago

karawoo commented 4 years ago

When I try to define attributes with an interval measurement scale, inferring the domain with col_classes = "numeric" fails. I believe the following should be valid:

library("tidyverse")
library("EML")

test <- tribble(
  ~attributeName, ~attributeDefinition,     ~unit, ~numberType, ~measurementScale,
       "degrees",          "degrees C", "celsius",      "real",         "interval"
)
set_attributes(test, col_classes = "numeric")
#> Error: For the attribute degrees the measurementScale value inferred from col_classes
#>               does not agree with the measurementScale value existing
#>               in attributes. Check col_classes and the measurementScale
#>               column you provided.

numericDomain seems valid if I specify it explicitly:

test2 <- mutate(test, domain = "numericDomain")
test_table <- set_attributes(test2) # no errors

# Create fake dataset to validate
dataTable <- list(
  entityName = "foo.txt",
  entityDescription = "my data",
  physical = set_physical(tempfile()),
  attributeList = test_table
)

dataset <- list(
  title = "foo",
  creator = as_emld(person("Kara", "Woo")),
  pubDate = 2020,
  intellectualRights = "CC-0",
  contact = as_emld(person("Kara", "Woo")),
  dataTable = dataTable
)

eml <- list(
  dataset = dataset
)

write_eml(eml, "test.xml")
eml_validate("test.xml")
#> [1] TRUE
#> attr(,"errors")
#> character(0)

Created on 2020-02-16 by the reprex package (v0.3.0.9001) I thought these should behave the same, but am I missing something?

amoeba commented 4 years ago

Hey @karawoo, thanks for filing, esp. with your reprex.

Your first example code looks quite reasonable. After a little look into the code, I think I see what's going on. When col_classes is specified, set_attributes tries to infer a measurementScale from the provided value. However, a value of numeric may map to either ratio (e.g., e.g. degrees K) or interval (e.g., degrees C) but infer_domain_scale assumes numeric should be ratio:

https://github.com/ropensci/EML/blob/f5538a319cd7fe7cf946f851769b1e5a4f46482f/R/set_attributes.R#L364

This is the best default IMHO but breaks your quite-reasonable code. You can see it working fine when a col_classes value of ratio is provided instead of interval:

library(tidyverse)
library(EML)

test <- tribble(
  ~attributeName, ~attributeDefinition, ~unit,     ~numberType, ~measurementScale,
  "degrees",      "degrees C",          "celsius", "real",      "ratio"
)
set_attributes(test, col_classes = "numeric") # No errors

The error comes about when infer_domain_scale (1) maps your numeric col class value to ratio and then compares the inferred value (ratio) with your provided value in the measurementScale column of your attributes data.frame.

Note in my above permalink that @cboettig put a comment indicating to me he might have meant to come back to this at some point. I think set_attributes could be reworked a bit here but I ran into some questions with my attempted factor about what the desired behavior of this function would be so I thought comment and see what @cboettig thinks.

I think the thing to do is make infer_domain_scale continue to assume ratio for numeric columns but not error when measurementScale is explicitly set as interval by changing up the logic a bit.

cboettig commented 4 years ago

@amoeba yup, I think your take is spot on; the implementation is assuming ratio always, when it should be handling cases like Kara's example explicitly. (& wow did I write nice helpful comments: # ! ). Happy for a PR!

amoeba commented 4 years ago

PR'd in #297. I started out with a refactor of the logic in infer_domain_scale to support checking against multiple possible values (interval, ratio) but it ended up involving more code than I liked to fit it into how infer_domain_scale was plumbed so I opted for a lighter patch which just addresses this problem directly. @karawoo 's example is included in the test suite to confirm it's fixed. Let me know what you think @cboettig .