isamplesorg / metadata

Collation of metadata examples and notes for the project
https://isamplesorg.github.io/metadata/
7 stars 2 forks source link

Use DataCite Subjects model for keywords #121

Closed smrgeoinfo closed 1 year ago

smrgeoinfo commented 1 year ago

DataCite JSON schema implementation of subjects would work very nicely to accomodate extension vocabulaires or categorization using other vocabularies. We could change 'keywords' to 'subjects' and use verbatim or use keywords/keyord. Example:

 "subjects": [
        {
            "subject": "Glacier environment",
            "subjectScheme": "iSamples Sampled feature",
            "schemeUri": "https://w3id.org/isample/vocabulary/sampledfeature/0.9/sampledfeaturevocabulary",
            "valueUri":"https://w3id.org/isample/vocabulary/sampledfeature/0.9/glacierenvironment"
        },
        {
            "subject": "Transcription profiling"
        },
        {
            "subject": "Anthropogenic material",
            "schemeUri": "https://w3id.org/isample/vocabulary/material/0.9/materialsvocabulary"
        },{
            "subject": "Analytical preparation",
            "valueUri":"https://w3id.org/isample/vocabulary/specimentype/0.9/analyticalpreparation"
        },
        {
            "subject": "Aalenian Age",
            "subjectScheme": "Geologic Time Scale",
            "schemeUri": "https://w3id.org/gso/geologictimescale/ontology",
            "valueUri":"https://w3id.org/gso/geologictime/AalenianAge"
        }
    ]

all of these are valid on the DataCite JSON scheme v 4.3

datadavev commented 1 year ago

Internally, I'd like the index to use concept URIs for vocabulary terms. That means any subject string must be mappable to a valueUri that points to a definition of the concept for which the subject is a label.

For any core schema, the subject can be used to find the concept (i.e. the definition) since we have the vocabulary in hand.

I think the subject could be used to find any concept that is an extension in use within an iSamples index, though some sort of cache warming may be needed. The problem is that a subject string is not mappable until we encounter a statement providing the correspondence - then what do we do with records previously encountered that have the same string but the mapping was unknown? It is very expensive to impose any sort of iterative re-indexing when dealing with this number of records.

Subject strings that are not mappable to a concept will likely be ignored as a vocabulary term, though could perhaps be bundled into the full text search for the record.

The subject string alone is also problematic in that the same string may be used in different concepts, thus creating ambiguity.

smrgeoinfo commented 1 year ago

Internally, I'd like the index to use concept URIs for vocabulary terms. That means any subject string must be mappable to a valueUri that points to a definition of the concept for which the subject is a label.

I agree. This opens the door to lots of interesting possibilities, like language localization, use of alt labels specific to some community interface, as well as sematic search using transitive closure and 'semantic proximity'

Subject strings that are not mappable to a concept will likely be ignored as a vocabulary term, though could perhaps be bundled into the full text search for the record.

yes--we consider any of the existing keywords (not mapped to URI) or future 'subject' words (with no valueURI) to be free text and put them in the full text index field.

subjects with value URI from a scheme that we recognize as an extension could be added in the facet hierarchy; if the valueURI is not from a known scheme (ideally the schemeURI would be provided...) i.e. we don't know how it hangs off of any of the iSamples categories, then it gets indexed as free text, but presented in search results as a labeled link (where the target is the value URI).
If we index valueURIs then users could search by URI if they knew that might give them useful results.

datadavev commented 1 year ago

The solr index provides a mechanism for synonyms matching at query time^1, which makes it much simpler to handle the case where there are potentially free text and more well defined terms including URIs for referencing a vocabulary term in records.

smrgeoinfo commented 1 year ago

Does that address the use case where I want to search for samples categorized as geosciml:granite or any of its child concepts?

smrgeoinfo commented 1 year ago

working on update to linkML yaml and JSON schema. I think that the informal_classification property is redundant with keywords and suggest deprecating it. Any existing values (if there are any...) can be made into Keywords/keyword values without loss of information.

smrgeoinfo commented 1 year ago

addressed in PR #136, see https://github.com/isamplesorg/metadata/commit/938fbeda25a5bf816a990345dbebb28ceef4797a#diff-793708b3d263d7022df56554026a4df738d75ff25c345551e38151850e9eac23. merged into development