ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
97 stars 33 forks source link

Including protocol causes validation issues #350

Open RobLBaker opened 1 month ago

RobLBaker commented 1 month ago

I hope this is the correct place to put this. I'm trying to add the protocol element to EML and it's causing EML::eml_validate() to produce errors. I can't for the life of me see where the error is, according to the documentation models (see image below). Perhaps if someone could point me to an example of valid EML with a protocol that would help, but I haven't found any yet (i.e. 95% chance this is me missing something simple).

My questions are:

  1. Does the eml_validate() function and/or the schema behind it have an error regarding protocol?
  2. Does the eml.ecoinformatics.org/schema/ documentation have an error regarding protocol?
  3. If not, what am I doing wrong and how do I fix it?

minimally valid EML with just the protocol element (so the problem isn't the protocol element itself):

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" packageId="EXAMPLE title" xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" system="unknown">
    <protocol>
    <title>test protocol title</title>
    <creator>
      <individualName>
        <surName>test</surName>
      </individualName>
    </creator>
    <distribution>
      <online>
        <url>https://doi.org/10.57830/xxxxxxx</url>
      </online>
    </distribution>
  </protocol>
</eml:eml>

minimally valid EML a dataset element (so the dataset component is not the problem):

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" packageId="Example: title" xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" system="unknown">
    <dataset>
        <alternateIdentifier>doi: https://doi.org/10.57830/xxxxxxx</alternateIdentifier>
        <title>EXAMPLE: title</title>
        <creator>
            <individualName>
                <surName>EXAMPLE</surName>
            </individualName>
        </creator>
        <pubDate>2022-11-11</pubDate>
        <abstract>
            <para>The abstract goes here</para>
        </abstract>
        <intellectualRights>
            <para>This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.
            </para>
        </intellectualRights>
        <maintenance>
            <description>complete</description>
        </maintenance>
        <contact>
            <individualName>
                <surName>EXAMPLE</surName>
            </individualName>
        </contact>    
        <dataTable>
            <entityName>Example Intercept Observations</entityName>
            <entityDescription>just some example data</entityDescription>
            <physical>
                <objectName>Example_Data_Cleaned.csv</objectName>
                <size unit="bytes">191995</size>
                <authentication method="MD5">d2f8fe468e393c41c6dccf30bab1a91a</authentication>
                <dataFormat>
                    <textFormat>
                        <numHeaderLines>1</numHeaderLines>
                        <recordDelimiter>\n</recordDelimiter>
                        <attributeOrientation>column</attributeOrientation>
                        <simpleDelimited>
                            <fieldDelimiter>,</fieldDelimiter>
                        </simpleDelimited>
                    </textFormat>
                </dataFormat>
            </physical>
            <attributeList>
                <attribute>
                    <attributeName>scientificName</attributeName>
                    <attributeDefinition>The full scientific name for the observed species according to the Guide to the Vascular Plants of Florida, Second Edition published by Richard P. Wunderlin and Bruce F. Hansen in 2003. University Press of Florida.</attributeDefinition>
                    <storageType>string</storageType>
                    <measurementScale>
                        <nominal>
                            <nonNumericDomain>
                                <textDomain>
                                    <definition>The full scientific name for the observed species according to the Guide to the Vascular Plants of Florida, Second Edition published by Richard P. Wunderlin and Bruce F. Hansen in 2003. University Press of Florida.</definition>
                                </textDomain>
                            </nonNumericDomain>
                        </nominal>
                    </measurementScale>
                </attribute>
            </attributeList>
            <numberOfRecords>500</numberOfRecords>
        </dataTable>
    </dataset>
</eml:eml>

But if I put dataset and protocol together in a single document the EML is invalid:

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" packageId="EXAMPLE title" xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" system="unknown">
  <dataset>
    <alternateIdentifier>doi: https://doi.org/10.57830/xxxxxxx</alternateIdentifier>
    <title>EXAMPLE: title</title>
    <creator>
      <individualName>
        <surName>EXAMPLE</surName>
      </individualName>
    </creator>
    <pubDate>2022-11-11</pubDate>
    <abstract>
      <para>The abstract goes here</para>
    </abstract>
    <intellectualRights>
      <para>This data package is released to the "public domain" under Creative Commons CC0 1.0 "No Rights Reserved" (see: https://creativecommons.org/publicdomain/zero/1.0/). It is considered professional etiquette to provide attribution of the original work if this data package is shared in whole or by individual components. A generic citation is provided for this data package on the website https://portal.edirepository.org (herein "website") in the summary metadata page. Communication (and collaboration) with the creators of this data package is recommended to prevent duplicate research or publication. This data package (and its components) is made available "as is" and with no warranty of accuracy or fitness for use. The creators of this data package and the website shall not be liable for any damages resulting from misinterpretation or misuse of the data package or its components. Periodic updates of this data package may be available from the website. Thank you.
</para>
    </intellectualRights>
    <maintenance>
      <description>complete</description>
    </maintenance>
    <contact>
      <individualName>
        <surName>EXAMPLE</surName>
      </individualName>
    </contact>    
    <dataTable>
      <entityName>Example Intercept Observations</entityName>
      <entityDescription>just some example data</entityDescription>
      <physical>
        <objectName>Example_Data_Cleaned.csv</objectName>
        <size unit="bytes">191995</size>
        <authentication method="MD5">d2f8fe468e393c41c6dccf30bab1a91a</authentication>
        <dataFormat>
          <textFormat>
            <numHeaderLines>1</numHeaderLines>
            <recordDelimiter>\n</recordDelimiter>
            <attributeOrientation>column</attributeOrientation>
            <simpleDelimited>
              <fieldDelimiter>,</fieldDelimiter>
            </simpleDelimited>
          </textFormat>
        </dataFormat>
      </physical>
      <attributeList>
        <attribute>
          <attributeName>scientificName</attributeName>
          <attributeDefinition>The full scientific name for the observed species according to the Guide to the Vascular Plants of Florida, Second Edition published by Richard P. Wunderlin and Bruce F. Hansen in 2003. University Press of Florida.</attributeDefinition>
          <storageType>string</storageType>
          <measurementScale>
            <nominal>
              <nonNumericDomain>
                <textDomain>
                  <definition>The full scientific name for the observed species according to the Guide to the Vascular Plants of Florida, Second Edition published by Richard P. Wunderlin and Bruce F. Hansen in 2003. University Press of Florida.</definition>
                </textDomain>
              </nonNumericDomain>
            </nominal>
          </measurementScale>
        </attribute>
      </attributeList>
      <numberOfRecords>500</numberOfRecords>
    </dataTable>
  </dataset>
  <protocol>
    <title>test protocol title</title>
    <creator>
      <individualName>
        <surName>test</surName>
      </individualName>
    </creator>
    <distribution>
      <online>
        <url>https://doi.org/10.57830/xxxxxxx</url>
      </online>
    </distribution>
  </protocol>
</eml:eml>

Specifically, I get the error:

EML::eml_validate(metadata)
[1] FALSE
attr(,"errors")
[1] "Element 'protocol': This element is not expected. Expected is one of ( annotations, additionalMetadata )."

From what I can tell, neither annotations nor additionalMetadata are required elements. Furthermore, when I do have additionalMetadata elements, I get this same error. If I remove the protocol element and have just the dataset and additionalMetadata elements, the EML is valid (so there is no problem with the additionalMetadata - I'm just not including those examples here for the sake of brevity).

image

RobLBaker commented 1 month ago

Ok, I think maybe I solved my problem. Does the diagram mean that an EML document can only describe a dataset OR a citation OR software OR a protocol, but not both a dataset and a protocol?

image

Assuming that is correct, is there a specific location that a published protocol describing data collection for a dataset should reside? would it be (potentially one of many) referencePublication? Or a citation within a methodStep in the methods?

cboettig commented 1 month ago

an EML document can only describe a dataset OR a citation OR software OR a protocol, but not both a dataset and a protocol?

That's my understanding. Like you suggest I think there are multiple valid ways but the methodStep seems natural for the kind of relationship I think you have here? @mbjones may be able to provide much better insight here on the recommended way to link them.

mbjones commented 1 month ago

Yes, the model enforces that an EML document can only describe one resource at a time (and which is what is being described by the EML document and its packageId). But, if a protocol applies to a dataset, then that can be described in the dataset/methods section which is repeatable, and supports multiple methodStep children, and a hierarchy of methods substeps. See: https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_methods

For example:

/eml/dataset
├── methods
│   └── methodStep
├── methods
│   ├── methodStep
│   └── methodStep
│       └── subStep
└── methods