ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Lack of error/warning with conflicting attribute slots are set #213

Closed jeanetteclark closed 5 years ago

jeanetteclark commented 7 years ago

In using the package, we have noticed a sometimes frustrating lack of error generation when conflicting attributes are accidentally set.

Example: An attribute table is generated using this code:

attributes  <- data.frame(attributeName = c("Date",
                                            "Location",
                                            "Roll"),
                          attributeDefinition = c("Date in YYYY-MM-DD",
                                                  "Location Name",
                                                  "Year, orientation, and flight number of photographic set. The last two digits of the year, a letter representing the orientation of the camera, and a digit representing the flight number"),
                          measurementScale = c("dateTime", 
                                               "nominal",
                                               "nominal"),
                          domain = c("dateTimeDomain",
                                     "textDomain",
                                     "numericDomain"),
                          formatString = c("YYYY-MM-DD",NA,NA),
                          definition = c(NA,"Location Name","Year, orientation, and roll"),
                          unit = c(NA, NA, NA), 
                          numberType = c(NA, NA, NA),
                          missingValueCode = c(NA,NA,NA),
                          missingValueCodeExplanation = c(NA,NA,NA),
                          stringsAsFactors = FALSE)

Note that the third attribute is measurementScale "nominal", domain "textDomain".

When set_attributes is run on this table, it returns the following for the third attribute:

<attribute>
    <attributeName>Roll</attributeName>
    <attributeDefinition>Year, orientation, and flight number of photographic set. The last two digits of the year, a letter representing the orientation of the camera, and a digit representing the flight number</attributeDefinition>
  </attribute>

Which of course, results in the following validation error: "Element 'attribute': Missing child element(s). Expected is one of ( storageType, measurementScale )."

When attribute lists are very long, it can be very difficult to track down which attribute has the problem. Would it be possible to have the set_attributes function return a warning or error if there are conflicting slots, as opposed to it just not writing the slots and then returning the error as a validation problem much later down the line?

amoeba commented 7 years ago

This type of checking is probably extremely useful to users. I think part of the problem here is that the xml_validate function from the xml2 package doesn't print out the line/col of the error so it's hard to debug validation issues. It used to when this package used the XML facility for schema validation.

As a stop-gap for our users, I've implemented a helper function that can be used to isolate the problem attribute. Basically it creates a minimal EML document for each attribute with only that attribute and validates each document. An invalid document then means the attribute is invalid and the appropriate validation error message pops up. See https://github.com/NCEAS/arcticdatautils/blob/master/R/eml.R#L729

cboettig commented 7 years ago

@jeanetteclark Thanks for the bug report.

@amoeba True, though I think this highlights two other issues:

jeanetteclark commented 7 years ago

Thanks @cboettig, we do use excel to generate most of our attributes which are read into R as .csvs. This does help quite a bit with the bigger attribute tables, but I have found I still occasionally make mistakes and have to subsequently hunt them down.

amoeba commented 7 years ago

I wasn't aware of the checking set_attributes already did, so this is great. And I agree that checking before writing out to disk makes sense. I've PR'd a commit that adds a check to set_attributes for the combination of measurementScale and domain. You'll get errors like

Error: The attribute in row 1 has an invalid combination of measurementScale (dateTime) and domain (numericDomain). For a measurementScale of 'dateTime', domain must be 'dateTimeDomain'.

Is that useful? Are there more types of checks? I haven't looked so I'm sure there are.

cboettig commented 7 years ago

Thanks @amoeba , I think your #214 PR effectively addresses this, so will close.

There are probably other strategies that might be a less piecemeal approach to making sure the R object is always a valid EML subset (e.g. perhaps with S4 validation) though that would probably be tedious and not necessarily better. Meanwhile I think the approach of check_and_complete_attributes is very practical.