Additional Custom Metadata Fields

plazi / arcadia-project

2 stars 1 forks source link

Additional Custom Metadata Fields #170

Open mguidoti opened 3 years ago

mguidoti commented 3 years ago

@slint,

I'm having some issues on our email server, so @myrmoteras asked me to post this as a Github issue.

I basically sent you an email late last week asking for these additional custom metadata fields so I can push a small digital specimens dataset:

term	resolving url
scientificName	http://rs.tdwg.org/dwc/terms/scientificName
scientificNameID	http://rs.tdwg.org/dwc/terms/scientificNameID
taxonID	http://rs.tdwg.org/dwc/terms/taxonID
namePublishedInID	http://rs.tdwg.org/dwc/terms/namePublishedInID
catalogNumber	http://rs.tdwg.org/dwc/terms/catalogNumber
kingdom	http://rs.tdwg.org/dwc/terms/kingdom
phylum	http://rs.tdwg.org/dwc/terms/phylum
class	http://rs.tdwg.org/dwc/terms/class
order	http://rs.tdwg.org/dwc/terms/order
family	http://rs.tdwg.org/dwc/terms/family
genus	http://rs.tdwg.org/dwc/terms/genus
specificEpithet	http://rs.tdwg.org/dwc/terms/specificEpithet
scientificNameAuthor	http://rs.tdwg.org/dwc/terms/scientificNameAuthorship
scientificNameAuthorYear	http://rs.tdwg.org/dwc/terms/namePublishedInYear
basisOfRecord	http://rs.tdwg.org/dwc/terms/basisOfRecord
physicalSetting	http://rs.tdwg.org/ac/terms/physicalSetting
lifeStage	http://rs.tdwg.org/dwc/terms/lifeStage
sex	http://rs.tdwg.org/dwc/terms/sex
individualCount	http://rs.tdwg.org/dwc/terms/individualCount
institutionCode	http://rs.tdwg.org/dwc/terms/institutionCode
collectionCode	http://rs.tdwg.org/dwc/terms/collectionCode
otherCatalogNumbers	http://rs.tdwg.org/dwc/terms/otherCatalogNumbers
typeStatus	http://rs.tdwg.org/dwc/terms/typeStatus
identifiedBy	http://rs.tdwg.org/dwc/iri/identifiedBy
dateIdentified	http://rs.tdwg.org/dwc/terms/dateIdentified
country	http://rs.tdwg.org/dwc/terms/country
stateProvince	http://rs.tdwg.org/dwc/terms/stateProvince
county	http://rs.tdwg.org/dwc/terms/county
locality	http://rs.tdwg.org/dwc/terms/locality
decimalLatitude	http://rs.tdwg.org/dwc/terms/decimalLatitude
decimalLongitude	http://rs.tdwg.org/dwc/terms/decimalLongitude
verbatimElevation	http://rs.tdwg.org/dwc/terms/verbatimElevation
eventDate	http://rs.tdwg.org/dwc/terms/eventDate
recordedBy	http://rs.tdwg.org/dwc/terms/recordedBy
preparations	http://rs.tdwg.org/dwc/terms/preparations
associatedSpecimenReference	http://rs.tdwg.org/ac/terms/associatedSpecimenReference
captureDevice	http://rs.tdwg.org/ac/terms/captureDevice
resourceCreationTechnique	http://rs.tdwg.org/ac/terms/resourceCreationTechnique
subjectOrientation	http://rs.tdwg.org/ac/terms/subjectOrientation
subjectPart	http://rs.tdwg.org/ac/terms/subjectPart
creator	http://purl.org/dc/elements/1.1/creator
rightsHolder	http://purl.org/dc/terms/rightsHolder

I think you old me in the pass that this list is exactly what you need to made these additions.. right?

Oh, and please, note that some of these you already added...

Cheers!

slint commented 3 years ago

I think you old me in the pass that this list is exactly what you need to made these additions.. right?

Generally yes, there are just a couple of things to clarify

Namespaces

We have to assign a "short" identifier for each namespace. I propose the following:

dwc: http://rs.tdwg.org/dwc/terms/
ac: http://rs.tdwg.org/ac/terms/
dc: http://purl.org/dc/terms/

`dc.creator`

Regarding dc:creator, it's better to use the http://purl.org/dc/terms/ namespace (vs the /elements/1.1 namespace), which is also recommended by the DublinCore docs. Thus there's only one new namespace added above.

Types of the fields

We currently assign a type to each custom field. These allow different kinds of searches:

text: full-text like searches (like e.g. the ones we have for the description field in Zenodo). Values in these fields would also be lightly processed, i.e. the value "This software is open-source" would be split and searchable by the terms "this", "software", "open", "source".
keyword: exact match searches, even in mixed-case values (e.g. SARS-CoV-2 vs. sars-cov-2).

Human-friendly labels

Another secondary thing is to have a "human-friendly" label for each term for the UI, i.e. "subjectPart" -> "Subject part".

Once we have the above confirmed we can amend them to the current configuration of custom keywords and deploy to Sandbox and Production with a day's delay.

mguidoti commented 3 years ago

Hi @slint, thanks for your reply!

Ok, if I understand you correctly, you need me to amend the provided table with the types of fields and humna-friendly labels. Is that correct?

Regarding the namespace observation and the dc:creator case, I've nothing to add other than 'sorry!'. All good for us.

Now, I sent this by email but I think it got lost with our server issues from last week, so I'm sending here again:

Additionally, there is one field specifically, identifiedBy, where he wants to include both name and ORCID, of one or multiple people. This is the single field that would require an array as far as I can see from the sample data. How would you recommend handling this?

Thanks in advance,

retog commented 3 years ago

I think we should add all terms to vocab.plazi.org

Another secondary thing is to have a "human-friendly" label for each term for the UI, i.e. "subjectPart" -> "Subject part".

Just wanted to point out, that the human friendly names are present in the ontology:

Here's an extract of the text/turtle version:

http://rs.tdwg.org/ac/terms/subjectPart rdfs:isDefinedBy http://rs.tdwg.org/ac/terms/; dcterms:isPartOf http://rs.tdwg.org/ac/terms/; dcterms:created "2013-10-28"^^xsd:date; dcterms:modified "2020-01-27"^^xsd:date; rdfs:label "Subject Part"@en; skos:prefLabel "Subject Part"@en;

mguidoti commented 3 years ago

@slint not sure if you saw my reply to your comment, but I think I still need some guidance!

Thanks in advance!

slint commented 3 years ago

I am sorry, I quickly read through and missed some points.

Ok, if I understand you correctly, you need me to amend the provided table with the types of fields and human-friendly labels. Is that correct?

That's optional for the time being since it's only for visually showing them up on the Zenodo record page in the sidebar. I think based on Reto's recommendation for the labels, it could be easily done after we add them to the accepted terms (so no rush on this one).

Regarding the namespace observation and the dc:creator case, I've nothing to add other than 'sorry!'. All good for us.

I had no idea either, just read through DublinCore and found out randomly :)

Now that I think of it, but maybe this is a longer discussion, aren't the creators and contributor fields that we already have on the form covering these values? We already actually serialize these in our DublinCore export format with the dc:creator tag (see example). Or is it meant to be used to declare a different type of creators? In that case the next point maybe sheds some light.

Additionally, there is one field specifically, identifiedBy, where he wants to include both name and ORCID, of one or multiple people. This is the single field that would require an array as far as I can see from the sample data. How would you recommend handling this?

So, if for example, you have "John Smith (ORCiD: 1234)" and "Jane Doe (ORCiD: 5678)" you could submit something like:

{
    ...,
    "custom": {
        "dwc:identifiedBy": [
            "John Smith",
            "1234",
            "Jane Doe",
            "5678",
        ]
    },
}

This would allow matching a search query on this custom keyword by any of the values, the exact name and/or the ORCiD (which might be good enough for now).

Related to the dc:creator issue I mentioned above, this e.g. looks like a more specific term for specifying a creator. If that is the use cases there are, then maybe we don't need to add the dc:creator term at all, since it already maps to an existing property.

mguidoti commented 3 years ago

Hi @slint,

I think you're right about the creator... but I'm checking because there is also a recordedBy which could mean the creator of the digital record/photo... But I think we can move forward in the meanwhile, no? I guess it's on your plate?

Thanks a lot, and, let me know if you need anything!

slint commented 3 years ago

All terms have been added to Sandbox and Production, so the following metadata example should be possible now:

curl -X POST "https://sandbox.zenodo.org/api/deposit/depositions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $ACCESS_TOKEN" \
    --data @- << EOF
{
    "metadata": {
        ...
        "custom": {
            "ac:subjectOrientation": ["dorsal"],
            "dwc:identifiedBy": [
                "John Smith",
                "1234",
                "Jane Doe",
                "5678",
            ],
            "dc:rightsHolder": ["John Smith"]
        }
    }
}
EOF

@mguidoti let's discuss maybe in today's call briefly what are the next steps.

mguidoti commented 3 years ago

@slint,

The only two fields from the .csv missing in the table above are the identifiedByID and recordedByID. Both of them would be added like you mentioned above:

"dwc:identifiedBy": [
                "John Smith",
                "1234",
                "Jane Doe",
                "5678",
            ],

Could you confirm, map what you needed, and let me know when I can push it?

The idea as we discussed in the previous meeting is to push to sandbox for evaluation first.

Thanks!

slint commented 3 years ago

@mguidoti But the identifiedByID and recordedByID are not part of the DWC terms. Since the fields don't specify if they expect a person's name or identifier (although it does mention things about format):

Definition	A list (concatenated and separated) of names of people, groups, or organizations who assigned the Taxon to the subject.
Notes	Recommended best practice is to separate the values in a list with space vertical bar space (`\\|`).
Examples	`James L. Patton`, `Theodore Pappenfuss \\| Robert Macey`

IMHO one value per name and identifier is a way we can make this work for search purposes as well, and in case there are in the future specific fields for adding an identifier only, we can revisit and update the metadata accordingly.

On our side both Sandbox and production systems have the custom keywords configuration deployed, so we can start testing the first uploads and see if the result looks good. We can have a call as well to check together and move things forward.

slint commented 3 years ago

Actually, I just came across https://github.com/tdwg/dwc/issues/102, which argues about potentially adding the recordedByID in DWC (and there's another issue for identifiedByID)... What's funny is that GBIF already has a DWC extension schema, which includes these new terms. Should we include them from the GBIF schema extension? Since this is just metadata we can always modify them later, in case they finally add the terms officially in DWC. WDYT @mguidoti @retog ?

I'll also share here later a suggested mapping from the .csv file you shared, to the custom metadata with some full examples.

On a side note, I'm not a big fan of the column/bar (|) separated names also, since it will make it impossible to search for specific values. One would have to search for the entire string, i.e. Smith, John | Doe, Jane, in that exact order, instead of just searching for the two terms Smith, John and Doe, Jane.

slint commented 3 years ago

@mguidoti here's the first example of a Zenodo request JSON metadata mapping, using the first line from the .csv file. Some quick points to be figured out or discussed:

What should be the title and description? We could use something like the dwc:scientificName value "Eremohaplomydas gobabebensis Boschert and Dikow, 2021"
What should be the license? CC-BY(-4.0)?
What's the publication_date? I used the dwc:eventDate, but we could also leave it out and just consider the date of publishing on Zenodo
For related_identifiers, for the "master"/"label" records we want to link to the different images that compose the different orientations of the specimens. On the pictures, we want to link to the "master"/"label" record
The upload_type for the "master" record is physicalspecimen. For the pictures we want upload_type: image, and image_type: photo? Or image_type: figure?
The method field could be a concatenation of "ac:captureDevice" and "ac:resourceCreationTechnique"

{
    'metadata': {
        # I used "dwc:eventDate", but it could be left empty (and
        # automatically take the current date of publishing)
        'publication_date': '2018-11-21',
        'upload_type': 'physicalobject',

        # TODO: What should be the title and description?
        'title': '???',
        'description': '???',

        # 
        'related_identifiers': [
            # TODO: If this is the "label"/"master" record, it should link to the different photos of the specimen
            {
                'identifier': '10.5281/zenodo.XYZ',
                'relation': 'hasPart',
                'resource_type': 'image-photo',  # TODO: Or should it be "image-figure"? 
            },
            {
                'identifier': '10.5281/zenodo.XYZ',
                'relation': 'hasPart',
                'resource_type': 'image-photo',  # TODO: Or should it be "image-figure"? 
            },

        ],
        'license': 'cc-by',  # TODO: Or other?

        'communities': [{'identifier': 'biosyslit'}],

        # I concatenated "ac:captureDevice", and "ac:resourceCreationTechnique", but can also be left blank
        'method': 'GIGAmacro Magnify2, full-frame DSLR, 65 mm f2.8 macro-lens, twin-flash. focus stacking, 3:1, scale=5 mm',

        'creators': [{
            'name': 'T. Dikow', 
            'orcid': 'https://orcid.org/0000-0003-4816-2909',
            'affiliation': 'USNM',  # Is this correct?
        }],
        'locations': [
            {
                'lat': '-23.56333', 'lon': '15.03278',
                'place': 'Namib-Naukluft National Park, Gobabeb, dunes W of Kuiseb riverbed',
            },
        ],
        'custom': {
            'dwc:scientificName': ['Eremohaplomydas gobabebensis Boschert and Dikow, 2021'],
            'dwc:scientificNameID': ['http://zoobank.org/745D49C1-62B8-4884-9F7F-2B82523373D3'],
            'dwc:catalogNumber': ['USNMENT01518012'],
            'dwc:kingdom': ['Animalia'],
            'dwc:phylum': ['Arthropoda'],
            'dwc:class': ['Insecta'],
            'dwc:order': ['Diptera'],
            'dwc:family': ['Mydidae'],
            'dwc:genus': ['Eremohaplomydas'],
            'dwc:specificEpithet': ['gobabebensis'],
            'dwc:scientificNameAuthor': ['Boschert and Dikow'],
            'dwc:scientificNameAuthorYear': ['2021'],
            'dwc:basisOfRecord': ['PreservedSpecimen'],
            'dwc:lifeStage': ['Adult'],
            'dwc:sex': ['male'],
            'dwc:individualCount': ['1'],
            'dwc:institutionCode': ['USNM'],
            'dwc:collectionCode': ['Entomology'],
            'dwc:typeStatus': ['Paratype'],

            # After discussions, we'll be using GBIF's DWC extension for ORCIDs
            'dwc:identifiedBy': ['Boschert, C.', 'Dikow, T.'],
            'gbif-dwc:identifiedByID': ['https://orcid.org/0000-0003-4816-2909'],
            # Note that if we had multiple ORCIDs we should be storing them as separate values:
            # 'gbif-dwc:identifiedByID': ['https://orcid.org/0000-0003-4816-2909', 'https://orcid.org/0000-0003-1234-5678'],

            'dwc:dateIdentified': ['2019'],
            'dwc:country': ['Namibia'],
            'dwc:stateProvince': ['Erongo'],
            'dwc:locality': ['Namib-Naukluft National Park, Gobabeb, dunes W of Kuiseb riverbed'],
            'dwc:decimalLatitude': ['-23.56333'],
            'dwc:decimalLongitude': ['15.03278'],
            'dwc:verbatimElevation': ['401 m'],
            'dwc:eventDate': ['2018-11-21'],

            # Using GBIF's DWC extension for ORCIDs
            'dwc:recordedBy': ['Dikow, T.'],
            'gbif-dwc:recordedByID': ['https://orcid.org/0000-0003-4816-2909'],

            'dwc:preparations': ['Pinned'],
            'ac:captureDevice': ['GIGAmacro Magnify2, full-frame DSLR, 65 mm f2.8 macro-lens, twin-flash'],
            'ac:resourceCreationTechnique': ['focus stacking, 3:1, scale=5 mm'],
            'ac:subjectOrientation': ['dorsal'],
            'ac:subjectPart': ['whole organism habitus'],
            'dc:rightsHolder': ['Smithsonian Institution - public domain'],
        }
    }
}

mguidoti commented 3 years ago

Ok, so, replying to the points you raised:

regarding recordedByID and identifiedByID: I think the recordedBy is the creator of the digital specimen and image/photo records, don't you think? If so, we shouldn't be worried about the recordedByID as I stated before, because I can simply add the ORCID as part of the creator data. I'm totally ok with the solution you proposed (of using the DWC extension schema temporarily) for the identifiedByID. As a matter of a fact, I think this is entirely up to you to decide.

title: yes, the dwc:scientificName, definitely. And what about the photos? What should I use as title @slint?

description: I know it's a required field but I don't have many ideas here. Trying to keep in mind the Fabricius dataset, where metadata is scarce, I think we could settle to add only the specimen code again. For photos, again, not sure. Other idea is to compile all fields and values, as provided by the partner, and added as the description. But this might change one day (say, a reinterpretation of the hand-written locality), so, I'm not sure... What do you think?

license: yep, I think CC-BY(4.0), unless @myrmoteras has something else to say.

I think the publication_date is the date of publishing on Zenodo, because we are publishing the digital version of the specimen and the dwc:eventDate is the date of the collection of the specimen, to the best of my knowledge.

Yes, totally agree for the related_identifiers. Just don't forget that the master will now be associated with a txt file including only the specimen code (the only unique, reliable piece of info, that shouldn't change, and the minimum piece of info required for a digital specimen according to recent discussions within the community). The label picture will have a record on its own (upload_type: image, image_type: photo), and it will be associated as a related_identifier to the master record, just like the other images. What do you think, @slint?

I would say that the image_type is photo, not figure, as I understand figure as something more abstract than photo - and well, we have photos here. But that's my understanding. @myrmoteras?

I guess it's ok to replicate the info from ac:captureDevice and ac:resourceCreationTechnique in method.

I guess we're close..!

slint commented 3 years ago

regarding recordedByID and identifiedByID: I think the recordedBy is the creator of the digital specimen and image/photo records, don't you think? If so, we shouldn't be worried about the recordedByID as I stated before, because I can simply add the ORCID as part of the creator data. I'm totally ok with the solution you proposed (of using the DWC extension schema temporarily) for the identifiedByID. As a matter of a fact, I think this is entirely up to you to decide.

If the data can be captured in the creators.orcid field then we're good, no need for adding the non-official fields. I wouldn't jump too early into adopting a new-ish DWC schema extension, especially since there's already discussion for the fields to be added to DWC, and we would then revisit updating the metadata.

title: yes, the dwc:scientificName, definitely. And what about the photos? What should I use as title @slint?

Would something like Eremohaplomydas gobabebensis Boschert and Dikow, 2021 (whole organism habitus, dorsal), which is basically {dwc:scientificName} ({ac:subjectPart}, {ac:subjectOrientation}) make sense? This makes the title specific enough (based on how much the fields change for each entry), but still short and descriptive.

I'm not sure if there's some domain-specific convention already for that already though, so @mguidoti, @myrmoteras (or someone else) might have some insight.

description: I know it's a required field but I don't have many ideas here. Trying to keep in mind the Fabricius dataset, where metadata is scarce, I think we could settle to add only the specimen code again. For photos, again, not sure. Other idea is to compile all fields and values, as provided by the partner, and added as the description. But this might change one day (say, a reinterpretation of the hand-written locality), so, I'm not sure... What do you think?

Some logical concatenation of already existing information might be enough.

I think the publication_date is the date of publishing on Zenodo, because we are publishing the digital version of the specimen and the dwc:eventDate is the date of the collection of the specimen, to the best of my knowledge.

Sounds good, we leave it empty and get automatically the publishing date :+1:

Yes, totally agree for the related_identifiers. Just don't forget that the master will now be associated with a txt file including only the specimen code (the only unique, reliable piece of info, that shouldn't change, and the minimum piece of info required for a digital specimen according to recent discussions within the community). The label picture will have a record on its own (upload_type: image, image_type: photo), and it will be associated as a related_identifier to the master record, just like the other images. What do you think, @slint?

I agree, just wasn't sure about the hierarchy/organization of the different objects.

I guess we're close..!

Looks like it! Let's get this shipped :raised_hands: :rocket: :ant:

myrmoteras commented 3 years ago

@slint

Actually, I just came across tdwg/dwc#102, which argues about potentially adding the recordedByID in DWC (and there's another issue for identifiedByID)... What's funny is that GBIF already has a DWC extension schema, which includes these new terms. Should we include them from the GBIF schema extension?

I would recommned to follow-up with what GBIF does because they are the trendsetter because of their use, and thus contribute to being a "standard". Also, there is large effort in the biodiv community to assign ORCIDs to persons, which will augment the use of it. Our ORCID might be scarce, but the few will play a decisive role, especially in the publishing world (Pensoft, EJT)

slint commented 3 years ago

I would recommned to follow-up with what GBIF does because they are the trendsetter because of their use, and thus contribute to being a "standard". Also, there is large effort in the biodiv community to assign ORCIDs to persons, which will augment the use of it. Our ORCID might be scarce, but the few will play a decisive role, especially in the publishing world (Pensoft, EJT)

In that case, I've added the gbif-dwc:recordedByID and gbif-dwc:identifiedByID in our configuration so we can use them. I've updated the original example, with the following changes:

        'custom': {
            ...,
            'dwc:identifiedBy': ['Boschert, C.', 'Dikow, T.'],
            'gbif-dwc:identifiedByID': ['https://orcid.org/0000-0003-4816-2909'],
            # Note that if we had multiple ORCIDs we should be storing them as separate values:
            # 'gbif-dwc:identifiedByID': ['https://orcid.org/0000-0003-4816-2909', 'https://orcid.org/0000-0003-1234-5678'],

            'dwc:recordedBy': ['Dikow, T.'],
            'gbif-dwc:recordedByID': ['https://orcid.org/0000-0003-4816-2909'],
            ...
        },
    ...

mguidoti commented 3 years ago

Great timing!

I just finished a dashboard for @myrmoteras and will be finally looking at this today.

Cheers!