globalbioticinteractions / carvalheiro2023

GloBI configuration to help index Luisa Carvalheiro, José Augusto Salim, Filipi Soares, Debora Drucker. 2023. WorldFAIR pilot data from: VisitationData_Luisa_Carvalheiro.
0 stars 0 forks source link

Create EML examples for elton export #1

Open zedomel opened 10 months ago

jhpoelen commented 10 months ago

from meeting 2023-11-27 https://docs.google.com/document/d/1MKUFLdGscFODvkW8NfzrP8LhJbDks3L0FXLyae6yZ34/edit

zedomel commented 10 months ago

@jhpoelen and Filipi

I included a file in the repository eml-example.xml with metadata for this dataset in EML standard. One interesting thing in EML is the possibility to describe dataTables which are very similar to the globi.json:tableSchema.

Do you think that instead of including a JSON in globi.json format we can work with EML JSON-LD format eml-jsonld.json? The ideia is to used EML JSON-LD to perform the same setup as in GloBI JSON format. It can have the same name globi.json but use a different format (EML JSON-LD).

Using EML JSON-LD we can provide dataset metadata and provide a definition for elton read the dataset in the same file and using a well-known standard (FAIR right?)

I'm curious to hear your thoughts.

thanks.

zedomel commented 10 months ago

In EML the coverage term can be automatic generated by elton:

It can be similar to what IPT does when creating/uploading datasets to GBIF (automatic generate EML coverage records from data)

What do you think?

tks

jhpoelen commented 10 months ago

@zedomel enriching metadata (e.g., taxonomic range, geographic range, and temporal range) from the data sounds like a great idea! Can you provide specific examples of these values based on the carvalheiro2023 dataset? We can use these examples in unit tests to develop the functionality.

jhpoelen commented 10 months ago

re: eml.json - thanks for making that EML example! I've left some comments on the commit and also created https://github.com/globalbioticinteractions/globalbioticinteractions/issues/942 as a way to explore the idea of using EMLs for GloBI indexing configurations.

Filipi-Soares commented 10 months ago

Very nice example @zedomel Much easier to understand the whole data structure now. Regarding the data annotation in the spreadsheet, how should we do it? I see in this example you shared, targetTaxonName and sourceTaxonName are the atribute ID, and the attributeName is ScientificName. Should we annotate the spreadsheets with targetTaxonName / sourceTaxonName or just ScientificName ? My opinion is that we should adopt targetTaxonName / sourceTaxonName , otherwise we lose a lot in terms of semantics. I searched EML documentation, but I couldn't find these terms. Are they there? In case they are not, we add them to the list of potential extensions.

        <attribute id="targetTaxonName">
            <attributeName>Scientific Name</attributeName>
            <attributeDefinition>http://rs.tdwg.org/dwc/terms/scientificName</attributeDefinition>
            <storageType>string</storageType>
        </attribute>
        <attribute id="sourceTaxonName">
            <attributeName>Scientific Name</attributeName>
            <attributeDefinition>http://rs.tdwg.org/dwc/terms/scientificName</attributeDefinition>
            <storageType>string</storageType>
        </attribute>
jhpoelen commented 10 months ago

I just added an example of using eml.xml to define table schema's for interaction data

See https://github.com/globalbioticinteractions/globalbioticinteractions/issues/942

Note that I've disabled the globi.json in the carvalheiro2023 repository by renaming it to globi.json.disabled. This means that GloBI table config is driven from the data in the eml.xml .

Please review.

jhpoelen commented 10 months ago

@Filipi-Soares re:

My opinion is that we should adopt targetTaxonName / sourceTaxonName , otherwise we lose a lot in terms of semantics. I searched EML documentation, but I couldn't find these terms. Are they there? In case they are not, we add them to the list of potential extensions.

I do like to idea to align the terms at some point. And, I also realize that different communities (e.g., GloBI, REBIPP) use different terms and it may take a little time to get the communities to speak the same language. Actually, it may take a very long time, if it happens at all. Before considering translation vs. normalization of terms, would it be an idea to first get our prototype / workflows working, and then consider ways to optimize / simplify?

Filipi-Soares commented 10 months ago

@Filipi-Soares re:

My opinion is that we should adopt targetTaxonName / sourceTaxonName , otherwise we lose a lot in terms of semantics. I searched EML documentation, but I couldn't find these terms. Are they there? In case they are not, we add them to the list of potential extensions.

I do like to idea to align the terms at some point. And, I also realize that different communities (e.g., GloBI, REBIPP) use different terms and it may take a little time to get the communities to speak the same language. Actually, it may take a very long time, if it happens at all. Before considering translation vs. normalization of terms, would it be an idea to first get our prototype / workflows working, and then consider ways to optimize / simplify?

@jhpoelen I agree with you. These terminology alignments can be done later.

zedomel commented 10 months ago

Very nice example @zedomel Much easier to understand the whole data structure now. Regarding the data annotation in the spreadsheet, how should we do it? I see in this example you shared, targetTaxonName and sourceTaxonName are the atribute ID, and the attributeName is ScientificName. Should we annotate the spreadsheets with targetTaxonName / sourceTaxonName or just ScientificName ? My opinion is that we should adopt targetTaxonName / sourceTaxonName , otherwise we lose a lot in terms of semantics. I searched EML documentation, but I couldn't find these terms. Are they there? In case they are not, we add them to the list of potential extensions.

        <attribute id="targetTaxonName">
          <attributeName>Scientific Name</attributeName>
          <attributeDefinition>http://rs.tdwg.org/dwc/terms/scientificName</attributeDefinition>
          <storageType>string</storageType>
        </attribute>
        <attribute id="sourceTaxonName">
          <attributeName>Scientific Name</attributeName>
          <attributeDefinition>http://rs.tdwg.org/dwc/terms/scientificName</attributeDefinition>
          <storageType>string</storageType>
        </attribute>

Hi @Filipi-Soares the sourceTaxonName, targetTaxonName and all others attribute ID's are not from EML, the ar from GloBI dictionary. One option is to create IRI versions of terms in the GloBI dictionary (e.g. https://globalbiotainteractions.org/terms/targetTaxonName).

zedomel commented 10 months ago

I just added an example of using eml.xml to define table schema's for interaction data

See globalbioticinteractions/globalbioticinteractions#942

Note that I've disabled the globi.json in the carvalheiro2023 repository by renaming it to globi.json.disabled. This means that GloBI table config is driven from the data in the eml.xml .

Please review.

That is fantastic!!!!! ;-D

Filipi-Soares commented 9 months ago

@jhpoelen Do you have any example of EML records in RDF? I've been thinking about converting the EML metadata records we have to RDF, to give an alternative serialization to the final users. I checked the namespaces of EML, but they are a bit confusing. Could you help?

jhpoelen commented 9 months ago

I'd have to dig around to find examples of EML records in RDF.

To help better understand what you have in mind. I was wondering about:

What kind of queries are you thinking of having the RDF answer? Which use case are you thinking about? Who would use the eml.xml, eml.json, and eml.rdf and how would this work towards the WorldFAIR or other goals?

Filipi-Soares commented 9 months ago

Hey @jhpoelen Good questions you presented. I'll do my best to address them.

What kind of queries are you thinking of having the RDF answer? -- At present, integrating a platform capable of running SPARQL queries into our project seems beyond our scope. Nevertheless, we could explore this as a potential development in the future. It's worth noting that FAIR metadata catalogs, like the FAIR Data Point (Santos, 2023), predominantly use RDF metadata records. I think that formatting our metadata records in RDF now would lay a foundation for future interoperability with similar platforms. Here some citations that might be useful for this discussion:

"Principle I1 requests that a formal, accessible, shared, and broadly applicable lan-guage for knowledge representation be used to embed machine-actionable semantics (e.g., RDF/OWL, RuleML, CycL) but it gives no recommendation on how to select the best option in any particular use case". (Magagna et al. 2020).

However, we should also consider that we are using FIPs as a "FAIR Metadata Catalog" and it doesn't necessarily request RDF metadata.

"because the FIPWizard captures and outputs Com-munity-specific FIPs as JSON, we have written custom pipelines to convert the FIP Wizard format to nanopublications [17] that can then be permanently published on the decentralized, federated nanopublication server network [18]" (Magagna et al. 2020).

Thus, it is up to us to decide how to implement it, but I still believe that having RDF metadata would be helpful in the sense of interoperability. I noticed that EML does not declare specific URI to each metadata element in the schema (at least I didn't find it). In this case, for a more precise RDF serialization, we could do a mapping of the metadata records to some metadata schema that was developed for use in semantic web applications, such as DCAT. In the end, we would have an EML metadata record, which we already have for all datasets, and also an alternative metadata record with DCAT descriptors in RDF.

Which use case are you thinking about? -- For all the datasets we are working with, but just for the metadata record of this datasets (Dataset title, creator name, etc.).

Who would use the eml.xml, eml.json, and eml.rdf and how would this work towards the WorldFAIR or other goals? As I said, I think we could have both implementations for the metadata: one eml.xml or eml.json, and another one in RDF (this one as a mapping, if we come to the conclusion that is not possible to implement EML in RDF).

PS.: Although it's not yet clear if EML can be effectively used in RDF format, I still recommend considering a mapping to broader metadata schemas like DCAT. EML is certainly relevant in biodiversity and ecology contexts, and its use there makes perfect sense. However, given that our metadata records incorporate generic elements, adopting a more universally applicable standard such as DCAT could significantly enhance interoperability – a critical component of the FAIR principles. This approach, in my opinion, would potentially offer greater interoperability benefits than exclusively using EML.

References Magagna, B., Schultes, E. A., Pergl, R., Hettne, K. M., Kuhn, T., & Suchánek, M. (2020, September 21). Reusable FAIR Implementation Profiles as Accelerators of FAIR Convergence. https://doi.org/10.31219/osf.io/2p85g

Santos, Luiz Olavo Bonino da Silva, Kees Burger, Rajaram Kaliyaperumal, Mark D. Wilkinson; FAIR Data Point: A FAIR-Oriented Approach for Metadata Publication. Data Intelligence 2023; 5 (1): 163–183. doi: https://doi.org/10.1162/dint_a_00160

deboradrucker commented 9 months ago

It would be great to have at least one rdf example in our report - allied to the description of the steps taken to create them

Although we don´t have specific questions to ask now, as @Filipi-Soares mentioned, it would be one more step toward semantic interoperability and can be useful for future work

Filipi-Soares commented 9 months ago

Here an example of the metadata record in RDF:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix sdo: <https://schema.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dwc: <http://rs.tdwg.org/dwc/terms/> .
@prefix dcterms: <http://purl.org/dc/terms/> .

<https://docs.google.com/spreadsheets/d/1cJ0qX9ppqHoSyqFykwYJef-DFOzoutthBXjwKRY81T8/edit#gid=359918449> rdf:type sdo:Dataset ;
    dcterms:title "Plant-flower visitor network from Avon Gorge, UK" ;
    sdo:creator [
        rdf:type foaf:Person ;
        foaf:name "Luisa Gigante Carvalheiro" ;
        foaf:mbox <mailto:lgcarvalheiro@gmail.com> ;
        foaf:based_near [
            rdf:type foaf:Location ;
            foaf:city "Goiania" ;
            foaf:country "Brazil"^^<http://www.w3.org/2001/XMLSchema#string> ;
        ] ;
        foaf:affiliation [
            rdf:type foaf:Organization ;
            foaf:name "Universidade Federal de Goiás" ;
        ] ;
    ] ;
    sdo:publisher [
        rdf:type foaf:Organization ;
        foaf:name "Universidade Federal de Goiás" ;
    ] ;
    sdo:license <https://creativecommons.org/licenses/by/4.0/> ;
    sdo:description "This dataset gathers information on interactions between plants
    and their flower visitors collected throughout 2004 (11 surveys covering local flowering season) the Avon Gorge (England), an iconic field site well known for its rare plant populations. The study area (1480 m2 ) included a broad
    range of flowering plants, and overall the dataset shows information for 260 species (81 plant species, 179 insect species and morphospecies)." ;
    dcterms:rights "Creative Commons Attribution 4.0 International" ;
    sdo:keywords "plant-pollinator interactions, flower visitation"^^<http://www.w3.org/2001/XMLSchema#string> ;
    dcterms:spatial "Avon Gorge, Bristol, England"^^<http://www.w3.org/2001/XMLSchema#string> ;
    dcat:startDate "2004-05-10"^^<http://www.w3.org/2001/XMLSchema#date> ;
    dcat:endDate "2004-09-27"^^<http://www.w3.org/2001/XMLSchema#date> ;
    dcterms:description "all taxa were identified by specialist taxonomists"^^<http://www.w3.org/2001/XMLSchema#string> ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Hymenoptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Diptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Coleoptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Heteroptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Lepidoptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:taxon [
        rdf:type dwc:Taxon ;
        dwc:scientificName "Thysanoptera"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dwc:MeasurementOrFact "A total of 11 survey visits were carried out from 10 May to 27 September 2004, this covering the main period of insect activity. Flower and insect surveys took place approximately every 14 days under dry conditions. In each flower abundance survey, a stratified random design was used to select 1 m2 quadrats in the study area. The area was divided into nine sub-areas based on habitat type and accessibility. Each sub-area was divided into 1 m2 quadrats and 2·5% (37) of these were randomly selected per sampling occasion. In each quadrat, the number of floral units of each plant species was recorded, defined as the distance that a small bee (c.1 cm length) would fly, rather than walk (Saville 1993). For example, in the Asteraceae, a flower unit is the entire inflorescence while in the Rosaceae, a flower unit is a single flower. Thus, the floral unit is defined from the bee’s perspective rather than by flower anatomy. Rare flowers which were missed using this method were included in the food web data as rare species with an abundance of two flower units (which was the lowest number of units observed in the plot for any species).
In the insect surveys, an observation point was chosen for each flowering plant species by randomly selecting one of the quadrats where the species was present. All the flowering units that could be surveyed by a single observer (approximately a semi-circle with 1-m radius) were observed for 20 min. On consecutive sampling occasions, plant species were rotated through three time slots, the morning (09.00–12.00 h), early afternoon (12.00–15.00 h) and late afternoon (15.00–18.00 h), to allow each species to be observed equally over time. At least two floral units were observed per plant species per sample. All flower–visitor interactions were recorded, and all visitors observed were collected for identification. To estimate the overall abundance of each plant species, the average number of flower units per 1 m 2 quadrat was multiplied by the total area of the study site. To estimate the interaction frequency for each visitor–plant species pair, we divided the total number of visits recorded by the number of flower units observed (per 20 min) and then multiplied by the total number of floral units in the study plot. By collecting the insects, we did not allow for repeated visits by the same individual; hence, some visitation frequencies may be underestimated. However, collecting specimens is essential for identification of most visitor species. Hymenoptera, Diptera, and Coleoptera were identified by taxonomists either to species or to morphospecies. Lepidoptera were identified to species by the authors and Heteroptera and parasitoids were morphotyped by the authors."^^<http://www.w3.org/2001/XMLSchema#string> ;
    sdo:funder [
        rdf:type schema:Organization ;
        schema:name "Fundação para a Ciência e Tecnologia (FCT, Portugal)"^^<http://www.w3.org/2001/XMLSchema#string> ;
        schema:identifier "Grant Number"^^<http://www.w3.org/2001/XMLSchema#string> ;
    ] ;
    dcterms:references "Carvalheiro, LG; Barbosa, E.R.M. & Memmott, J. 2008. Pollinator networks, alien species and the conservation of rare plants: Trinia glauca as a case study. Journal of Applied Ecology, 45,1419-1427. DOI: https://doi.org/10.1111/j.1365-2664.2008.01518.x"^^<http://www.w3.org/2001/XMLSchema#string> .
jhpoelen commented 9 months ago

@deboradrucker @Filipi-Soares great to have some examples of different ways to express dataset provenance and descriptors.

While rdf is often brought up as the go-to when semantic interoperability is mentioned, I do sometimes wonder what folks actually end up using when creating links to, or otherwise re-using, existing datasets. . .

jhpoelen commented 9 months ago

Note that I am using rdf/nquads to help document provenance. I find nquads useful, because they streamable. the rdf/trig format is a little more human readable though . . . but much harder to process with run-of-the-mill tools like "grep", "sed", etc.

Filipi-Soares commented 9 months ago

Hello @jhpoelen Happy new year!! Could you please share an example of how you are using rdf/nquads? I'm thinking about generating the metadata records this week for all datasets. What do you think?

jhpoelen commented 9 months ago

@Filipi-Soares hi! Good to hear from you. I like your idea to create metadata for worldFAIR datasets this week. And, I'd like to understand about your approach. Can you provide some examples? Also, how do you imagine the metadata would be re-used in the data review reports?

PS I've noticed that @cboettig wrote an R package to translate EML into JSON-LD . https://github.com/ropensci/emld . How does this initiative compare to the one you had in mind.

Thanks for being patient as I am trying to better understand your vision.

jhpoelen commented 9 months ago

and I am hoping to integrate the metadata encodings we come up with in WorldFAIR into the GloBI search index / data products. This way, the work lives beyond the WorldFAIR work packages. . .

Filipi-Soares commented 9 months ago

Hey @jhpoelen :) So I was thinking of something like this: https://github.com/globalbioticinteractions/carvalheiro2023/issues/1#issuecomment-1855661190 Back to that conversation we have had before, using generic metadata schemas to create the metadata records may be an interesting strategy since they are domain-independent. However, I'd like to see how you make the metadata records for GloBI. Could you please share an example?

jhpoelen commented 9 months ago

@Filipi-Soares thanks for replying and for refreshing my memory on your idea to translate EML to JSON-LD.

You can find an example of a similar exercise that @cboettig did in their rOpenSci emld package -

the eml file - https://github.com/ropensci/emld/blob/master/inst/extdata/hf205.xml

was (automatically) translated into - https://github.com/ropensci/emld/blob/master/inst/extdata/hf205.json

Note how the namespaces are still the same (e.g., the EML namespace). A next step in a translation could be to convert EML to other ontologies or worlds.

Curious to hear what you think about re-using @cboettig's approach instead of coming up with something new. I am sure that a lot of thought was put into the emld package and this may be a nice opening to share your thoughts on improving them.

Filipi-Soares commented 9 months ago

@jhpoelen, I appreciate your reference to @cboettig's work in the rOpenSci emld package. It's enlightening to see how the EML file was translated into JSON-LD format, maintaining the original EML namespaces. This leads me to ponder the specific implementation of namespaces in this context. For example, would the intellectualRights metadata property translate to the URI eml://ecoinformatics.org/eml-2.1.0/intellectualRights? My concern here is about the resolvability of such URIs for linked data purposes.

Furthermore, converting EML to an ontology raises a significant issue. Ideally, this task would require the involvement of the EML committee, particularly for registering ontology namespaces. Creating these namespaces independently could inadvertently imply taking ownership of them, which is not our intention. This is why I've been considering using generic or domain-agnostic vocabularies for generating metadata records, even though it might entail considerable effort.

While I acknowledge the potential benefits of adopting @cboettig's approach, these concerns about namespaces and ontology conversion are crucial to address. I'm eager to delve deeper into this and explore how we can enhance the existing methodologies.

jhpoelen commented 8 months ago

@Filipi-Soares thanks for sharing your thoughts on helping to mobilize knowledge captured in EML documents.

I can see the benefits of making namespaces URIs clickable and resolve to an informative html landing page. And, in my mind, any resolvable link today should be expected to stop resolving some time in the (near) future given natural phenomena such as linkrot.

Can see some use cases for transforming the EML data . . .

  1. create BibTeX / RIS snippets for easy citation of datasets and their associated data reviews
  2. use EMLs table definition as a way to document table definitions of some resource
  3. embed EML dataset descriptions in dataset reviews.

A first pass of implementing 2. is available and in use through Elton. https://github.com/globalbioticinteractions/globalbioticinteractions/issues/942 .

I've also started working on ways to move forward on items 1. (see also https://github.com/globalbioticinteractions/globalbioticinteractions/issues/798 ) and 3. (https://github.com/globalbioticinteractions/globalbioticinteractions/issues/954).

Hopefully 1-3 may help some guidance on how some EML->JSON-LD conversion helps facilitate re-use of metadata captured in EML files.

So, yes, having a way to make EML more readable sounds like a good idea, and I'd much like to discuss use cases 1-3 and learn about other use cases that you'd like to explore, especially in contact of the WorldFAIR project.