ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

misleading`set_attributes()` warning message re: custom units #253

Closed jagoldstein closed 5 years ago

jagoldstein commented 5 years ago

Note this warning output by EML::set_attributes():

> attributeList <- set_attributes(attributes = foo$attributes)
Warning message:
In set_attribute(attributes[i, ], factors = factors) :
  unit 'hectopascal' is not recognized, using custom unit.
          Please define a custom unit or replace with a
          recognized standard unit (see set_unitList() for details)

The warning instructs, "unit 'hectopascal' is not recognized, using custom unit. Please define a custom unit or..." but in this case the custom unit was already defined for 'hectopascal'.

This messaging should be adjusted so that it is not so confidently assertive. Something along the lines of Please ensure that it has a custom unit definition. Or This should be defined as a custom unit if it is not already.

Here is my MRE:

library(dataone)
library(EML)
library(arcticdatautils)
library(datamgmt)

mn <- MNode("https://arcticdata.io/metacat/d1/mn/v2")

pkg <- get_package(mn, "resource_map_urn:uuid:cd72f471-b333-428c-af25-6d96495a5adf")
eml <- read_eml(getObject(mn, pkg$metadata))

attributes_table <- get_attributes(eml@dataset@otherEntity[[3]]@attributeList)$attributes

foo <- create_attributes_table(NULL, attributes_table)

# change `dimensionless` to `hectopascal` for "AirPressure"
# then Quit the app

attributeList <- set_attributes(attributes = foo$attributes)

Here is my session info:

> session_info()
Session info -------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.423)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2018-11-30                  

Packages -----------------------------------------------------------------------------------
 package         * version   date       source                                
 arcticdatautils * 0.6.4     2018-11-27 Github (NCEAS/arcticdatautils@3ad784a)
 assertthat        0.2.0     2017-04-11 CRAN (R 3.4.0)                        
 base64enc         0.1-3     2015-07-28 CRAN (R 3.4.0)                        
 bindr             0.1.1     2018-03-13 CRAN (R 3.4.4)                        
 bindrcpp          0.2.2     2018-03-29 CRAN (R 3.4.4)                        
 bitops          * 1.0-6     2013-08-17 CRAN (R 3.4.2)                        
 cellranger        1.1.0     2016-07-27 CRAN (R 3.4.1)                        
 colorspace        1.3-2     2016-12-14 CRAN (R 3.4.1)                        
 commonmark        1.5       2018-04-28 CRAN (R 3.4.4)                        
 compare           0.2-6     2015-08-25 cran (@0.2-6)                         
 crayon            1.3.4     2017-09-16 CRAN (R 3.4.4)                        
 curl              2.7       2018-08-16 Github (jeroen/curl@01e53c0)          
 datamgmt        * 0.1.0     2018-08-15 Github (NCEAS/datamgmt@8cd5692)       
 dataone         * 2.1.1     2018-06-28 cran (@2.1.1)                         
 datapack        * 1.3.1     2017-08-29 CRAN (R 3.4.1)                        
 devtools        * 1.12.0    2016-12-05 CRAN (R 3.4.0)                        
 digest            0.6.18    2018-10-10 cran (@0.6.18)                        
 dplyr           * 0.7.4     2017-09-28 cran (@0.7.4)                         
 EML             * 1.0.3     2018-09-17 Github (ropensci/EML@1b3d0a2)         
 ggplot2           3.0.0     2018-07-03 CRAN (R 3.4.4)                        
 git2r             0.21.0    2018-01-04 CRAN (R 3.4.4)                        
 glue              1.3.0     2018-07-17 CRAN (R 3.4.4)                        
 gsubfn            0.7       2018-03-16 CRAN (R 3.4.4)                        
 gtable            0.2.0     2016-02-26 CRAN (R 3.4.1)                        
 hash              2.2.6     2013-02-21 CRAN (R 3.4.0)                        
 htmltools         0.3.6     2017-04-28 CRAN (R 3.4.2)                        
 htmlwidgets       1.2       2018-04-19 CRAN (R 3.4.4)                        
 httpuv          * 1.4.5     2018-07-19 CRAN (R 3.4.4)                        
 httr            * 1.3.1     2017-08-20 cran (@1.3.1)                         
 jsonlite          1.5       2017-06-01 cran (@1.5)                           
 knitr             1.20      2018-02-20 CRAN (R 3.4.4)                        
 later             0.7.3     2018-06-08 CRAN (R 3.4.4)                        
 lattice           0.20-38   2018-11-04 CRAN (R 3.4.4)                        
 lazyeval          0.2.1     2017-10-29 CRAN (R 3.4.4)                        
 listviewer      * 2.0.0     2018-03-26 CRAN (R 3.4.4)                        
 lubridate       * 1.7.4     2018-04-11 cran (@1.7.4)                         
 magrittr        * 1.5       2014-11-22 CRAN (R 3.4.0)                        
 memoise           1.1.0     2017-04-21 CRAN (R 3.4.0)                        
 mime              0.5       2016-07-07 CRAN (R 3.4.0)                        
 munsell           0.5.0     2018-06-12 CRAN (R 3.4.4)                        
 parsedate         1.1.3     2017-03-02 CRAN (R 3.4.0)                        
 pillar            1.3.0     2018-07-14 CRAN (R 3.4.4)                        
 pkgconfig         2.0.2     2018-08-16 cran (@2.0.2)                         
 plyr              1.8.4     2016-06-08 CRAN (R 3.4.0)                        
 promises          1.0.1     2018-04-13 CRAN (R 3.4.4)                        
 proto             1.0.0     2016-10-29 cran (@1.0.0)                         
 purrr           * 0.2.5     2018-05-29 CRAN (R 3.4.4)                        
 R6                2.3.0     2018-10-04 cran (@2.3.0)                         
 Rcpp              1.0.0     2018-11-07 cran (@1.0.0)                         
 RCurl           * 1.95-4.11 2018-07-15 CRAN (R 3.4.4)                        
 readxl            1.1.0     2018-04-20 CRAN (R 3.4.4)                        
 redland           1.0.17-10 2018-07-20 CRAN (R 3.4.4)                        
 remotes         * 1.1.1     2017-12-20 CRAN (R 3.4.3)                        
 rgdal           * 1.3-3     2018-06-22 CRAN (R 3.4.4)                        
 rlang             0.3.0.1   2018-10-25 cran (@0.3.0.1)                       
 roxygen2          6.1.0     2018-07-27 CRAN (R 3.4.4)                        
 rstudioapi        0.7       2017-09-07 CRAN (R 3.4.3)                        
 scales            0.5.0     2017-08-24 CRAN (R 3.4.1)                        
 shiny           * 1.1.0     2018-05-17 CRAN (R 3.4.4)                        
 shinyjs           1.0       2018-01-08 CRAN (R 3.4.4)                        
 sp              * 1.3-1     2018-06-05 CRAN (R 3.4.4)                        
 stringi           1.2.4     2018-07-20 CRAN (R 3.4.4)                        
 stringr           1.3.1     2018-05-10 cran (@1.3.1)                         
 tibble            1.4.2     2018-01-22 CRAN (R 3.4.3)                        
 tidyr             0.8.2     2018-10-28 cran (@0.8.2)                         
 udunits2          0.13      2016-11-17 CRAN (R 3.4.3)                        
 units             0.6-0     2018-06-09 CRAN (R 3.4.4)                        
 uuid            * 0.1-2     2015-07-28 CRAN (R 3.4.0)                        
 withr             2.1.2     2018-03-15 CRAN (R 3.4.4)                        
 XML             * 3.98-1.16 2018-08-19 cran (@3.98-1.)                       
 xml2              1.2.0     2018-01-24 cran (@1.2.0)                         
 xtable            1.8-2     2016-02-05 CRAN (R 3.4.3)                        
 yaml              2.1.19    2018-05-01 cran (@2.1.19)
amoeba commented 5 years ago

I wasn't able to reproduce due to what appears to be a permissions issue, but I could reproduce it with this MRE:

library(EML)

eml <- read_eml(system.file("xsd/test/eml-datasetWithAttributelevelMethods.xml", package = "EML"))
attrs <- get_attributes(eml@dataset@dataTable[[1]]@attributeList)
attrs$attributes[4,"unit"] <- "hectopascal"
set_attributes(attrs$attributes)

After running the code and looking at the warning, I actually prefer the warning as-is. Curious to hear what @cboettig thinks.

cboettig commented 5 years ago

Thanks both!

@amoeba side-note, looks like you're still running the S4 EML, in 2.0 your example would be:

library(EML)

eml <- read_eml(system.file("tests/eml-2.1.1/eml-datasetWithAttributelevelMethods.xml", package = "emld"))
attrs <- get_attributes(eml$dataset$dataTable$attributeList)
attrs$attributes[4,"unit"] <- "hectopascal"
set_attributes(attrs$attributes)

Which produces the warning @jagoldstein mentions above. Of course in this case we indeed have not defined hectopascal yet so the warning is appropriate, but the more general question I think still stands: in the current design, set_attributes has no access to or knowledge of what customUnits, if any, have not been set. Setting aside the tricky issue of whether set_attributes ought to be able to check the user's customUnit list first, I agree the warning could be more conditional, i.e.

Please be sure you also define a custom unit in your EML file, or replace with a recognized standard unit. See set_unitList() for details. 

instead of

Please define a custom unit or replace with a
          recognized standard unit (see set_unitList() for details)

(I would say "be sure you have defined", but using the past tense gives the suggestion that the set_attributes function may not work properly if you haven't already defined the unit, which is not the case.)

Also raises a more general question: I don't think failing to define custom units causes EML validation to fail? Bryce? Maybe it should?

jagoldstein commented 5 years ago

Great points @cboettig . I think Please be sure a custom unit is defined might be less confusing than Please be sure you also define... which could be interpreted as "do this because it's known that you haven't yet".

Alternatively, the message could include a disclaimer You may have already done so, but that seems too verbose.

amoeba commented 5 years ago

I don't think failing to define custom units causes EML validation to fail? Bryce? Maybe it should?

IIRC this does cause a validation error. Right, @jagoldstein ?

jagoldstein commented 5 years ago

EML::eml_validate() does NOT fail due to custom units.

But there is a failure when arcticdatautils::publish_update() or arcticdatautils::publish_object are used to publish an EML with custom units that are undefined.

Output example: Error in xml document. This EML instance is invalid because referenced id hectopascal does not exist in the given keys.</error>.

amoeba commented 5 years ago

Okay yeah. So that matches my understanding. Missing custom units makes for invalid EML but not necessarily schema-invalid EML which is all the EML package checks. The EML package's validation routine should be checking more than just schema validity which means checking custom attributes.

cboettig commented 5 years ago

@amoeba I think the eml_validate() function should check for EML validation as defined by the EML spec, e.g. https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md, which is more than just checking schema validation, but which does not (afaik) involve any checks to confirm that a custom unit is defined in the file. (I have an open PR on emld which adds the additional checks described in the spec, see https://github.com/cboettig/emld/pull/26).

I'm all for additional quality checks as separate functions (in articdatautils, EML or both) but don't think we should include checks in eml_validate that are not dictated by the EML specification -- we can't have inconsistent notions of what it means to be valid!

@mbjones thoughts on this? Is it worth including checks that any custom unit is defined as part of the validity specification?

amoeba commented 5 years ago

Totally agree. I imagine @mbjones meant to include that in the spec since we enforce it through our EML own validator.

cboettig commented 5 years ago

@amoeba um, do you remember where we stand on this one?

skimming the above it sounds like eml_validate() is behaving as we want it to (insisting custom units be defined), so I think we can close this?

amoeba commented 5 years ago

Yes, unless we want to massage the wording a bit.

I'd lobby, and can PR this if you 👍. (I wouldn't be miffed if the code stayed as-is):

> attributeList <- set_attributes(attributes = foo$attributes)
Warning message:
In set_attribute(attributes[i, ], factors = factors) :

Unit 'hectopascal' is not a recognized standard unit; treating as a custom unit. Please be sure you also define a custom unit in your EML record, or replace with a recognized standard unit. See set_unitList() for details.`

cboettig commented 5 years ago

Cool, that makes sense. Happy to have a PR to fix the wording

amoeba commented 5 years ago

PR'd.