Open atn38 opened 5 years ago
Hi An
You are correct that set_attributes
does not support multiple missing value codes, but you can still construct a valid EML document with multiple missing value codes using EML.
Here is a MRE:
me <- list(individualName = list(givenName = "Jeanette", surName = "Clark"))
attributes <- data.frame(attributeName = 'length_1',
attributeDefinition = 'def1',
measurementScale = 'ratio',
domain = 'numericDomain',
unit = 'meter',
numberType = 'real',
stringsAsFactors = FALSE)
att_list <- set_attributes(attributes)
att_list$attribute[[1]]$missingValueCode <- list(eml$missingValueCode(code = "A", codeExplanation = "exp 1"),
eml$missingValueCode(code = "B", codeExplanation = "exp 2"))
doc <- list(packageId = "id", system = "system",
dataset = list(title = "A Mimimal Valid EML Dataset",
creator = me,
contact = me,
dataTable = list(entityName = "data table", attributeList = att_list))
)
eml_validate(doc)
IMO adding support for this to set_attributes
would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.
Hope that helps!
Hi An, thanks for filing this issue!
Re:
IMO adding support for this to set_attributes would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.
I agree with @jeanetteclark here. set_attributes
has the limitation you point out and many others. Its limited ability is also part of its utility. There are a number of sub-elements underneath attribute
that are repeatable that the helper currently does not cover and, while adding support for one or two might be fine, I don't think we'd want to add them all.
Does @jeanetteclark 's "manual" solution work for you, or would writer a helper function for your group work? I could definitely help with that if needed.
@jeanetteclark I think this would be a great addition to the EML package. We need more of these helpers that make it easier to construct common patterns. Your MRE illustrates how much one needs to know about the EML schema to add the metadata to the EML document. Can you explain why you think this should be a separate function rather than an addition to set_attributes? Seems like this could be syntactically handled as:
attributes <- data.frame(attributeName = 'length_1',
attributeDefinition = 'def1',
measurementScale = 'ratio',
domain = 'numericDomain',
unit = 'meter',
numberType = 'real',
stringsAsFactors = FALSE,
missingValueCode = list(c(code = "A", codeExplanation = "exp 1"),
c(code = "B", codeExplanation = "exp 2")))
Would that work? Probably also nice if it accepted a non-list version as well in case someone only had to add one missing value code (so just missinngValueCode = c(code = "A", codeExplanation = "exp 1")
). Thoughts?
@mbjones the code you included above generates a somewhat garbled data.frame
. I agree with @amoeba that the value in set_attributes
is it's simplicity. data.frames
are easy to understand and easy to manipulate - I don't think we should muck them up with list columns or some other similar structure.
I like the idea of adding a helper function (set_missingValues
?)
Another option would be to add support to set_attributes
where the missing values codes and definitions are passed similar to the way that the factors
argument works. I don't prefer this though - adding missingValues
as an argument to set_attributes
to me would imply that if you want to set missing value codes you have to do it via that argument, when you really only need it if you have multiple missing value codes per attribute. This is in contrast to the factors
argument which is required for valid EML if you have an enumeratedDomain
. A similar set of functions to this case (high level helper and child element helper) exists in the package already - set_coverage
and set_taxonomicCoverage
Happy to hear arguments for the other side and write the function :)
Good point. And I forgot that as.data.frame converts the list to columns, which is indeed garbled (lists can be columns, but they have to be added separately I guess). I like your proposal to be consistent with how enumeratedDomain is handled with a separate factors
argument (although I wish that parameter were labeled domain
instead of factors). So, a new argument missingValueCodes
which takes a data.frame structured like factors
with columns attributeName
, code
, and codeExplanation
would be great and still pretty simple. So the new signature would be set_attributes(attributes, factors = NULL, col_classes = NULL, missingValueCodes = NULL)
. And a additional set_missingValues
would be a great helper too, but of course would only work once the attributes list was already created.
Yeah in thinking about it more - adding the argument to set_attributes
is probably the easiest way to go. It allows for easy addition of multiple missing value codes on multiple attributes, and as you said doesn't require having the attribute list already.
@amoeba: I'm working on a workflow/package to take a set of data.frames imported from PostgreSQL views (which will come from a metadata database schema designed for LTER sites), then use rEML under the hood to insert info from these dfs into appropriate slots, then validate and write EML docs. The workflow is meant to be reused with different datasets at different sites.
@jeanetteclark thank you for the MRE. Good to know that it's possible. I'd be interested in a general solution however, since I'm not writing custom EML-generation scripts for each dataset.
I like the idea of an additional argument to set_attributes()
that takes a three-column data.frame. Happy to help write it if I could be useful. We'd need a warning if there are already missingValueCode + Explanation
columns in the supplied attributes
data.frame.
+1 for @mbjones suggestion above (updating set_attributes to have a missingValueCodes argument and workflow paralleling that of the "factors" argument that currently populates codes/definitions for enumeratedDomain)
I submitted a PR with the addition of this feature for @cboettig and @amoeba to review - thanks for the suggestion @atn38 and @scelmendorf!
@jeanetteclark, thank you for the unbelievably quick feature add! Let me know if I got the behavior correct: (1) for a given attribute, the missingValues
argument takes precedence if there are code-def pairs in both attributes
and missingValues
. (2) If there is one code-def pair defined in attributes
but none in missingValues
, that code still makes it into EML, correct?
If above is true, then it'll be easier for the LTER metabase users to transition!
Edit: did some testing, (1) is true but (2) is not.
Hi @atn38 - yes you have the behavior correct.
In my testing, both of your listed cases are true. Have you set the columns in your attributes
data.frame as missingValueCode
and missingValueCodeExplanation
? I frequently forget what the verbiage is for these.
The code below shows both of your cases:
atts <- data.frame(attributeName = c('length_1', 'length_2'),
attributeDefinition = c('def1', 'def2'),
measurementScale = c('ratio', 'ratio'),
domain = c('numericDomain', 'numericDomain'),
unit = c('meter','meter'),
numberType = c('real','real'),
missingValueCode = c("NA", NA),
missingValueCodeExplanation = c("def",NA),
stringsAsFactors = FALSE)
missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))
set_attributes(atts, missingValues = missing_values)
set_attributes(atts)
I'll submit a separate issue to improve the documentation and error checking for correct column names in the attributes
table.
I should get into the habit of MREs! Usually lots of moving pieces in workflow. And yes -- more documentation for set_attributes
would be very helpful for new users -- I learned via trial and error that it takes a bunch of columns that doc doesn't mention.
The second case I'm concerned with is more like this:
library(EML)
atts <- data.frame(attributeName = c('length_1', 'length_2', 'length_3'),
attributeDefinition = c('def1', 'def2', 'def3'),
measurementScale = c('ratio', 'ratio', 'ratio'),
domain = c('numericDomain', 'numericDomain', 'numericDomain'),
unit = c('meter','meter', 'kilometer'),
numberType = c('real','real', 'real'),
missingValueCode = c("NA", NA, 'code_here'),
missingValueCodeExplanation = c("def", NA, 'explanation_here'),
stringsAsFactors = FALSE)
missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))
atts1 <- set_attributes(atts, missingValues = missing_values)
atts2 <- set_attributes(atts)
Interesting behavior here with atts1
, despite length_3
not appearing in missingValues
at all, set_attributes
still returns attribute 'length_3' has missing value codes set in both the 'attributes' and 'missingValues' data.frames. Using codes from 'missingValues' data.frame.
Also note how two missingValueCode
elements are returned for length_3
in atts1
, neither of which is what we want (we would want ("code_here", "explanation_here")
which is defined in attributes
):
atts1[["attribute"]][[3]][["missingValueCode"]]
[[1]]
[[1]]$code
[1] NA
[[1]]$codeExplanation
[1] NA
[[2]]
[[2]]$code
character(0)
[[2]]$codeExplanation
character(0)
Hopefully this makes sense!
Ah yes, thanks for that example. That definitely is not what we want! I'll look into it
okay @atn38 I made the function smarter to handle these cases. Would you mind installing the package from my fork to see if it works as you expect now?
devtools::install_github("jeanetteclark/EML")
Just installed in a fresh project with packrat enabled -- so I'm pretty
sure it's your fork -- any way to check definitively? Nothing seems to have
changed when above MRE is run. Still returns the attribute 'length_3' has missing value codes set in both the 'attributes' and 'missingValues' data.frames. Using codes from 'missingValues' data.frame
error and length_3
has two incorrect elements.
On Thu, Apr 18, 2019 at 12:34 PM Jeanette Clark notifications@github.com wrote:
okay @atn38 https://github.com/atn38 I made the function smarter to handle these cases. Would you mind installing the package from my fork to see if it works as you expect now?
devtools::install_github("jeanetteclark/EML")
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/EML/issues/268#issuecomment-484604892, or mute the thread https://github.com/notifications/unsubscribe-auth/AKAZD5RVRYHLN2ET5DIEIHTPRCWIXANCNFSM4HGJ54VQ .
@atn38 I just updated the documentation for the set_attributes
function. If you install again and look in the Details section of the help for set_attributes
you should see missingValueCode and missingValueCodeExplanation added to the "for all data" section
Works beautifully now -- thank you!
(I looked through prev. issues + set_attributes.R and found no reference to this. Apologies if feature's already implemented.)
It's important for scientific interpretation that metadata documents explanations for missing values. "data not collected due to field conditions" is different from "no specimens were found". EML spec supports repeated
missingValueCode
elements.If we want to implement this, how should the input to
set_attributes
look like? Most simple might be for themissingValueCode
andmissingValueCodeExplanation
columns in the data.frame supplied toset_attributes
to take two paired comma separated list of strings and parse them into sets ofmissingValueCode
EML elements.folks over at LTER would like to have this feature in a core metadata database design. How this gets done here might bear on how we do this in database schema and then in R workflow to generate EML, which uses this package under the hood.