ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

support for multiple missing value codes in set_attributes #268

Open atn38 opened 5 years ago

atn38 commented 5 years ago

(I looked through prev. issues + set_attributes.R and found no reference to this. Apologies if feature's already implemented.)

It's important for scientific interpretation that metadata documents explanations for missing values. "data not collected due to field conditions" is different from "no specimens were found". EML spec supports repeated missingValueCode elements.

If we want to implement this, how should the input to set_attributes look like? Most simple might be for the missingValueCode and missingValueCodeExplanation columns in the data.frame supplied to set_attributes to take two paired comma separated list of strings and parse them into sets of missingValueCode EML elements.

folks over at LTER would like to have this feature in a core metadata database design. How this gets done here might bear on how we do this in database schema and then in R workflow to generate EML, which uses this package under the hood.

jeanetteclark commented 5 years ago

Hi An

You are correct that set_attributes does not support multiple missing value codes, but you can still construct a valid EML document with multiple missing value codes using EML.

Here is a MRE:

me <- list(individualName = list(givenName = "Jeanette", surName = "Clark"))

attributes <- data.frame(attributeName = 'length_1',
                         attributeDefinition = 'def1',
                         measurementScale = 'ratio',
                         domain = 'numericDomain',
                         unit = 'meter',
                         numberType = 'real',
                         stringsAsFactors = FALSE)

att_list <- set_attributes(attributes)

att_list$attribute[[1]]$missingValueCode <- list(eml$missingValueCode(code = "A", codeExplanation = "exp 1"),
                                                 eml$missingValueCode(code = "B", codeExplanation = "exp 2"))

doc <- list(packageId = "id", system = "system", 
               dataset = list(title = "A Mimimal Valid EML Dataset",
                              creator = me,
                              contact = me,
                              dataTable = list(entityName = "data table", attributeList = att_list))
)

eml_validate(doc)

IMO adding support for this to set_attributes would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.

Hope that helps!

amoeba commented 5 years ago

Hi An, thanks for filing this issue!

Re:

IMO adding support for this to set_attributes would probably cause more problems than it would solve. A separate function, taking the attributeList, attributeName and a data.frame of missing value codes and definitions as arguments might be helpful. Carl or Bryce can maybe chime in on whether this would be appropriate for the EML package, or better suited for you to put in your own package.

I agree with @jeanetteclark here. set_attributes has the limitation you point out and many others. Its limited ability is also part of its utility. There are a number of sub-elements underneath attribute that are repeatable that the helper currently does not cover and, while adding support for one or two might be fine, I don't think we'd want to add them all.

Does @jeanetteclark 's "manual" solution work for you, or would writer a helper function for your group work? I could definitely help with that if needed.

mbjones commented 5 years ago

@jeanetteclark I think this would be a great addition to the EML package. We need more of these helpers that make it easier to construct common patterns. Your MRE illustrates how much one needs to know about the EML schema to add the metadata to the EML document. Can you explain why you think this should be a separate function rather than an addition to set_attributes? Seems like this could be syntactically handled as:

attributes <- data.frame(attributeName = 'length_1',
                         attributeDefinition = 'def1',
                         measurementScale = 'ratio',
                         domain = 'numericDomain',
                         unit = 'meter',
                         numberType = 'real',
                         stringsAsFactors = FALSE,
                         missingValueCode = list(c(code = "A", codeExplanation = "exp 1"),
                                                 c(code = "B", codeExplanation = "exp 2")))

Would that work? Probably also nice if it accepted a non-list version as well in case someone only had to add one missing value code (so just missinngValueCode = c(code = "A", codeExplanation = "exp 1")). Thoughts?

jeanetteclark commented 5 years ago

@mbjones the code you included above generates a somewhat garbled data.frame. I agree with @amoeba that the value in set_attributes is it's simplicity. data.frames are easy to understand and easy to manipulate - I don't think we should muck them up with list columns or some other similar structure.

I like the idea of adding a helper function (set_missingValues?)

Another option would be to add support to set_attributes where the missing values codes and definitions are passed similar to the way that the factors argument works. I don't prefer this though - adding missingValues as an argument to set_attributes to me would imply that if you want to set missing value codes you have to do it via that argument, when you really only need it if you have multiple missing value codes per attribute. This is in contrast to the factors argument which is required for valid EML if you have an enumeratedDomain. A similar set of functions to this case (high level helper and child element helper) exists in the package already - set_coverage and set_taxonomicCoverage

Happy to hear arguments for the other side and write the function :)

mbjones commented 5 years ago

Good point. And I forgot that as.data.frame converts the list to columns, which is indeed garbled (lists can be columns, but they have to be added separately I guess). I like your proposal to be consistent with how enumeratedDomain is handled with a separate factors argument (although I wish that parameter were labeled domain instead of factors). So, a new argument missingValueCodes which takes a data.frame structured like factors with columns attributeName, code, and codeExplanation would be great and still pretty simple. So the new signature would be set_attributes(attributes, factors = NULL, col_classes = NULL, missingValueCodes = NULL). And a additional set_missingValues would be a great helper too, but of course would only work once the attributes list was already created.

jeanetteclark commented 5 years ago

Yeah in thinking about it more - adding the argument to set_attributes is probably the easiest way to go. It allows for easy addition of multiple missing value codes on multiple attributes, and as you said doesn't require having the attribute list already.

atn38 commented 5 years ago

@amoeba: I'm working on a workflow/package to take a set of data.frames imported from PostgreSQL views (which will come from a metadata database schema designed for LTER sites), then use rEML under the hood to insert info from these dfs into appropriate slots, then validate and write EML docs. The workflow is meant to be reused with different datasets at different sites.

@jeanetteclark thank you for the MRE. Good to know that it's possible. I'd be interested in a general solution however, since I'm not writing custom EML-generation scripts for each dataset.

I like the idea of an additional argument to set_attributes() that takes a three-column data.frame. Happy to help write it if I could be useful. We'd need a warning if there are already missingValueCode + Explanation columns in the supplied attributes data.frame.

scelmendorf commented 5 years ago

+1 for @mbjones suggestion above (updating set_attributes to have a missingValueCodes argument and workflow paralleling that of the "factors" argument that currently populates codes/definitions for enumeratedDomain)

jeanetteclark commented 5 years ago

I submitted a PR with the addition of this feature for @cboettig and @amoeba to review - thanks for the suggestion @atn38 and @scelmendorf!

atn38 commented 5 years ago

@jeanetteclark, thank you for the unbelievably quick feature add! Let me know if I got the behavior correct: (1) for a given attribute, the missingValues argument takes precedence if there are code-def pairs in both attributes and missingValues. (2) If there is one code-def pair defined in attributes but none in missingValues, that code still makes it into EML, correct?

If above is true, then it'll be easier for the LTER metabase users to transition!

Edit: did some testing, (1) is true but (2) is not.

jeanetteclark commented 5 years ago

Hi @atn38 - yes you have the behavior correct.

In my testing, both of your listed cases are true. Have you set the columns in your attributes data.frame as missingValueCode and missingValueCodeExplanation? I frequently forget what the verbiage is for these.

The code below shows both of your cases:

atts <- data.frame(attributeName = c('length_1', 'length_2'),
                   attributeDefinition = c('def1', 'def2'),
                   measurementScale = c('ratio', 'ratio'),
                   domain = c('numericDomain', 'numericDomain'),
                   unit = c('meter','meter'),
                   numberType = c('real','real'),
                   missingValueCode = c("NA", NA),
                   missingValueCodeExplanation = c("def",NA),
                   stringsAsFactors = FALSE)

missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))

set_attributes(atts, missingValues = missing_values)
set_attributes(atts)

I'll submit a separate issue to improve the documentation and error checking for correct column names in the attributes table.

atn38 commented 5 years ago

I should get into the habit of MREs! Usually lots of moving pieces in workflow. And yes -- more documentation for set_attributes would be very helpful for new users -- I learned via trial and error that it takes a bunch of columns that doc doesn't mention.

The second case I'm concerned with is more like this:

library(EML)

atts <- data.frame(attributeName = c('length_1', 'length_2', 'length_3'),
                   attributeDefinition = c('def1', 'def2', 'def3'),
                   measurementScale = c('ratio', 'ratio', 'ratio'),
                   domain = c('numericDomain', 'numericDomain', 'numericDomain'),
                   unit = c('meter','meter', 'kilometer'),
                   numberType = c('real','real', 'real'),
                   missingValueCode = c("NA", NA, 'code_here'),
                   missingValueCodeExplanation = c("def", NA, 'explanation_here'),
                   stringsAsFactors = FALSE)

missing_values <- data.frame(attributeName = c("length_1", "length_2" ,"length_2"), code = c("1", "2", "3"), definition = c("A", "B", "C"))

atts1 <- set_attributes(atts, missingValues = missing_values)
atts2 <- set_attributes(atts)

Interesting behavior here with atts1, despite length_3 not appearing in missingValues at all, set_attributes still returns attribute 'length_3' has missing value codes set in both the 'attributes' and 'missingValues' data.frames. Using codes from 'missingValues' data.frame.

Also note how two missingValueCode elements are returned for length_3 in atts1, neither of which is what we want (we would want ("code_here", "explanation_here") which is defined in attributes):

atts1[["attribute"]][[3]][["missingValueCode"]]
[[1]]
[[1]]$code
[1] NA

[[1]]$codeExplanation
[1] NA

[[2]]
[[2]]$code
character(0)

[[2]]$codeExplanation
character(0)

Hopefully this makes sense!

jeanetteclark commented 5 years ago

Ah yes, thanks for that example. That definitely is not what we want! I'll look into it

jeanetteclark commented 5 years ago

okay @atn38 I made the function smarter to handle these cases. Would you mind installing the package from my fork to see if it works as you expect now?

devtools::install_github("jeanetteclark/EML")
atn38 commented 5 years ago

Just installed in a fresh project with packrat enabled -- so I'm pretty sure it's your fork -- any way to check definitively? Nothing seems to have changed when above MRE is run. Still returns the attribute 'length_3' has missing value codes set in both the 'attributes' and 'missingValues' data.frames. Using codes from 'missingValues' data.frame error and length_3 has two incorrect elements.

On Thu, Apr 18, 2019 at 12:34 PM Jeanette Clark notifications@github.com wrote:

okay @atn38 https://github.com/atn38 I made the function smarter to handle these cases. Would you mind installing the package from my fork to see if it works as you expect now?

devtools::install_github("jeanetteclark/EML")

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/EML/issues/268#issuecomment-484604892, or mute the thread https://github.com/notifications/unsubscribe-auth/AKAZD5RVRYHLN2ET5DIEIHTPRCWIXANCNFSM4HGJ54VQ .

jeanetteclark commented 5 years ago

@atn38 I just updated the documentation for the set_attributes function. If you install again and look in the Details section of the help for set_attributes you should see missingValueCode and missingValueCodeExplanation added to the "for all data" section

atn38 commented 5 years ago

Works beautifully now -- thank you!