ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
98 stars 33 forks source link

Best practice for count variables? #206

Closed maelle closed 5 years ago

maelle commented 7 years ago

Again I should maybe ask this question elsewhere, but let's say it could be part of the documentation. :smile_cat:

If dealing with counts, e.g. a variable whose definition is "Number of windows in the house", does one need to input an unit? The unit could be windowsPerHouse but does it make sense since the definition clearly indicates what was counted?

I've found this EML and I found this example

<attribute>
        <attributeName> count</attributeName>

        <attributeLabel> count</attributeLabel>

        <attributeDefinition>numer of the indicated species founf in the sample</attributeDefinition>

        <unit> </unit>

        <dataType> Integers</dataType>

        <attributeDomain>
             <numericDomain>
                <minimum>70.0</minimum>

                <maximum>99.0</maximum>

             </numericDomain>

        </attributeDomain>

        <missingValueCode> </missingValueCode>

        <precision> </precision>

    </attribute>

However I'm not sure what this document actually corresponds to. Could I use numericDomain without unit if I add a dataType equal to integers? And most importantly is this best practice?

Also I'm sorry if this was already written somewhere in the documentation.

cc @carlesmila

maelle commented 7 years ago

I also don't know whether there is an example with e.g. relative humidity, in an EML of mine I've written "dimensionless" as unit.

amoeba commented 7 years ago

I think the unit you're looking for is number, defined in STMML and included in the standard units dictionary in EML as:

  <unit id="number" name="number" unitType="dimensionless">
    <description>a number</description>
  </unit>
mbjones commented 7 years ago

@maelle Counts and ratios of quantities continue to be problematic. @amoeba is right that the number unit has been our traditional response, but I also wrote up a summary of other approaches and issues in a related ticket NCEAS/eml#265

maelle commented 7 years ago

Thanks both, I'll read all that 👌

maelle commented 7 years ago

Should the current traditional response reg. counts be mentioned in the documentation of this package, e.g. in the vignette about units, since it'll probably be a common question? And also something about ratios? Or is all more EML documentation than EML documentation?

If these things have no place in the package doc I guess I should this issue, thanks a lot for your insight in any case, it has been very useful!

amoeba commented 7 years ago

Attributes are one of the hardest parts of EML and I don't think, at least not for this case, the EML documentation is quite enough. Ideally, it would be! Given that, I'd put in a vote in for the vignette(s) about attributes/units cover these difficult cases so the user can stumble onto the right choice without too much digging. A short explanation of the issue followed by a "See the official EML standard documentation for more information" doesn't seem like a bad idea to me.

What about adding a ratio example and and a count example right below the energy example in https://github.com/ropensci/EML/blob/master/vignettes/working-with-units.Rmd ? I could make that tweak.

We have a fully-worked build-an-EML-doc-from-scratch example that I still need to clean up and turn into a vignette.

cboettig commented 7 years ago

Agree, I think our package documentation may as well address this, and using fully worked examples in vignettes to do so would be great. Ideally we can work this kind of thing into low-level documentation at some point (ideally the creation of the R package should be pulling out all the documentation strings from the XSD files when it generates the classes; but I haven't worked that up).

@amoeba Re cleaning up some fully-worked vignettes, that would be great; but I wonder if it's worth holding tight a bit longer for the S4 methods to be finished ( #197) . e.g. then stuff like:

library(EML)

doc <- new("eml")
doc@dataset@title <- c(new("title", .Data = "test"))
doc@dataset@creator <- c(new("creator"))
doc@dataset@creator[[1]]@individualName <- c(new("individualName", surName = new("surName", .Data = "Maël")))
doc@dataset@contact <- c(new("contact"))
doc@dataset@contact[[1]]@individualName <- c(new("individualName", surName = new("surName", .Data = "Maël")))

Would instead look more like

eml <- eml()
dataset <- dataset(eml, title = "test")
mael <- creator(individualName = individualName(surName = "Maël"))
creator(dataset) <- mael 
contact(dataset) <- mael
amoeba commented 7 years ago

Deal.