Closed jeanetteclark closed 5 years ago
This type of checking is probably extremely useful to users. I think part of the problem here is that the xml_validate
function from the xml2
package doesn't print out the line/col of the error so it's hard to debug validation issues. It used to when this package used the XML
facility for schema validation.
As a stop-gap for our users, I've implemented a helper function that can be used to isolate the problem attribute. Basically it creates a minimal EML document for each attribute with only that attribute and validates each document. An invalid document then means the attribute is invalid and the appropriate validation error message pops up. See https://github.com/NCEAS/arcticdatautils/blob/master/R/eml.R#L729
@jeanetteclark Thanks for the bug report.
@amoeba True, though I think this highlights two other issues:
We'd rather catch this issue sooner before we validate / write to xml; @maelle has already done some very nice work in making set_attributes
check for certain validation issues, e.g. #169, so I think we could continue that work to catch this kind of issue.
Perhaps there's something we can also do on the user interface side to help avoid this. For instance, the column-wise data entry of data.frame
makes it pretty hard to line up the rows, so it is harder to spot issues like the use of numericDomain
on this text data. Using a row-based entry (like dplyr::tribble
, or reading in metadata entered in originally on a spreadsheet) can help. Maybe there's something else.
Thanks @cboettig, we do use excel to generate most of our attributes which are read into R as .csvs. This does help quite a bit with the bigger attribute tables, but I have found I still occasionally make mistakes and have to subsequently hunt them down.
I wasn't aware of the checking set_attributes
already did, so this is great. And I agree that checking before writing out to disk makes sense. I've PR'd a commit that adds a check to set_attributes for the combination of measurementScale
and domain
. You'll get errors like
Error: The attribute in row 1 has an invalid combination of measurementScale (dateTime) and domain (numericDomain). For a measurementScale of 'dateTime', domain must be 'dateTimeDomain'.
Is that useful? Are there more types of checks? I haven't looked so I'm sure there are.
Thanks @amoeba , I think your #214 PR effectively addresses this, so will close.
There are probably other strategies that might be a less piecemeal approach to making sure the R object is always a valid EML subset (e.g. perhaps with S4 validation) though that would probably be tedious and not necessarily better. Meanwhile I think the approach of check_and_complete_attributes
is very practical.
In using the package, we have noticed a sometimes frustrating lack of error generation when conflicting attributes are accidentally set.
Example: An attribute table is generated using this code:
Note that the third attribute is measurementScale "nominal", domain "textDomain".
When set_attributes is run on this table, it returns the following for the third attribute:
Which of course, results in the following validation error: "Element 'attribute': Missing child element(s). Expected is one of ( storageType, measurementScale )."
When attribute lists are very long, it can be very difficult to track down which attribute has the problem. Would it be possible to have the set_attributes function return a warning or error if there are conflicting slots, as opposed to it just not writing the slots and then returning the error as a validation problem much later down the line?