schemaorg / suggestions-questions-brainstorming

Suggestions, questions, and brainstorming
19 stars 15 forks source link

Considering adding a new property for dataset descriptions. #249

Open buhem opened 4 years ago

buhem commented 4 years ago

Datasets are generally related to a particular field of science, but also to sub-fields. A user will be used to look for data corresponding to his/her speciality. To facilitate the research of data and the queries, it would be wise to have a property that maps this need. For example, in Linguistics, if I provide a dataset for a specific language that I describe, these data have been collected to analyse a particular phenomenon. Therefore, this kind of data could only be useful for people interested to study this phenomenon and working in Linguistics. Hence, I suggest to add a new property in Dataset to cover this problem, it will help researchers/users to find easily what they need without spending a long of time to consult wrong datasets.

A possible solution could be:

Example (JSON-LD serialisation): { "@context": "https://schema.org", "@type": "Dataset", "SubjectOf": { "@type": "FieldOfScience", "name": "Linguistics", "about":{ "@type": "FieldOfScience", "name": "Phonetics", "about": { "@type": "FieldOfScience", "name": "Acoustics", "about":{ "@type": "FieldOfScience", "name": "Bilabial stops" } } } } }

dr-shorthair commented 4 years ago

How about 'Field of Research' instead.

HughP commented 4 years ago

I believe this has come up before, and since datasets are often used outside of the generating field of study it has been seen that these additional descriptions are unnecessarily restrictive. Someone could always describe a dataset with defined terms matched to an ontology (for linguistics one could match to the GOLD 2010 ontology, the OLAC ontology for language archives, or even library of Congress subject terms. Another strategy would be to use the “about” key from “creative work” since datasets are creative works.

Additionally there is a debate within linguistics as to if these (many things that linguists consider data sets) are datasets. The dc/dcterms/dcmitype vocabularies suggest/define datasets as tabular data, whereas many things that linguists use are collections — of texts, or collections of audio files. Not to say that linguist don’t also use tabular data as well, we do, it’s just not usually considered the raw evidence for an academic or industry relevant argument. “The data” ≠ dataset.

On Mon, Sep 28, 2020 at 4:58 AM Simon Cox notifications@github.com wrote:

How about 'Field of Research' instead.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2724#issuecomment-699741978, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAJ2JUAXE2WQVOJQDZWE5TSH73T3ANCNFSM4R26PHLQ .

-- All the best, -Hugh

Sent from my iPhone Paris, France

buhem commented 4 years ago

How about 'Field of Research' instead.

I open to any suggestion. "Field of research" is fine too.

buhem commented 4 years ago

I believe this has come up before, and since datasets are often used outside of the generating field of study it has been seen that these additional descriptions are unnecessarily restrictive. Someone could always describe a dataset with defined terms matched to an ontology (for linguistics one could match to the GOLD 2010 ontology, the OLAC ontology for language archives, or even library of Congress subject terms. Another strategy would be to use the “about” key from “creative work” since datasets are creative works. Additionally there is a debate within linguistics as to if these (many things that linguists consider data sets) are datasets. The dc/dcterms/dcmitype vocabularies suggest/define datasets as tabular data, whereas many things that linguists use are collections — of texts, or collections of audio files. Not to say that linguist don’t also use tabular data as well, we do, it’s just not usually considered the raw evidence for an academic or industry relevant argument. “The data” ≠ dataset. - from a linguist, -Hugh On Mon, Sep 28, 2020 at 4:58 AM Simon Cox @.**> wrote: How about 'Field of Research*' instead. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2724 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAJ2JUAXE2WQVOJQDZWE5TSH73T3ANCNFSM4R26PHLQ . -- All the best, -Hugh Sent from my iPhone Paris, France

I don't think this kind of consideration was tackled before, at least in the issues part. I looked for any opened issue discussing this problem, I found nothing.

There is no restriction, in the contrary, that will help people to choose data more efficiently. If I am working in particular a field and I want to do interdisciplinary research, I can focus my queries on the fields of science I want to cross. If I have no idea where to look for, I just have to let this query empty.

I know those ontologies, but they are dedicated to describe features of a specific language. My purpose is to specify the area of study to which is related a dataset. In this way, if I am looking for Morphology, I will not get results from Life of Science or Mathematics, where this word is also used, but only from Linguistics, for instance.

Schema.org doesn't define Dataset as you do. Here is, the definition of Dataset according to them:

These "datasets", unlike typical use of Schema.org, can be in arbitrary formats. For example, they may include data that is stored in collections of spreadsheet files, or as digital images, or in dedicated scientific, geospatial and engineering file formats.

philbarker commented 4 years ago

I would like to throw in fieldOfStudy as a suggested name, because it has wider applicability that data sets; in particular it would be useful for LearningResource and other education-related resources. Definition: "a field, discipline or educational subject for which the resource relevant."

danbri commented 4 years ago

Thanks folks for the discussion. I can see value in exploring this further, but it would be more compelling if there were some commitment to implement from anyone consuming/using this markup.

I like the idea of a fieldOfStudy property being applicable across many CreativeWork types, but we need also to be mindful of the cost of adding more properties - it can be confusing to publishers (especially in the absence of any application declaring what they use/need) who will have to decide when to use /about, when to use /keywords etc. Presumably we would allow both simple textual labels and externally enumerated codelists (via DefinedTerm).

Finally on the matter of schema.org/Dataset scope, it is correct that we have an inclusive approach: many kinds of "data" are shared as datasets, and schema.org/Dataset is intended to cover those - regardless of format, paradigm etc. It could be geospatial dataset shapefiles, a spreadsheet, or a zip file full of interviews and photos...

I'm going to move this to the suggestions-questions-brainstorming repository but do please keep talking

HughP commented 4 years ago

@buhem

Now that I am at a computer and reading this whole thread I am better able to understand your proposal. I'm glad to see another linguist interested in schema.org. Maybe we can meet up in person and compare use cases (I have been working through the CVs of linguists and working to describe their works in schema.org terms.) I worked on a temporary contract at CNRS-LLACAN a few months ago. Maybe we can collaborate on something.

If I understand correctly, seems you want to create some sort of recursive Hierarchical category structure for terms X>Y>x>y , AND you want a field like 'field of study'. As @philbarker points out fieldOfStudy could have wider applications.

Help me better understand the use case here. What does the descriptor (and/or the consumer) get from this method of description that is not currently possible from using https://schema.org/keywords or https://schema.org/about or even https://schema.org/DefinedTerm ? What sort of recommendation could be given to consumers for when they encounter the same terminology but in differently articulated hierarchies? How should they treat this different from long list of https://schema.org/keywords :

Example 1.

{
"@context": "https://schema.org",
"@type": "Dataset",
"keywords": ["Linguistics", "Phonetics", "Acoustics", "Bilabial stops" ]
}

Example 2.

{
"@context": "https://schema.org",
"@type": "Dataset",
"keywords": [{
  "@type": "DefinedTerm",
    "termCode": "primary_text",
    "name": "Primary Text",
    "url": "http://www.language-archives.org/REC/type.html#primary_text"
    "inDefinedTermSet": "OLAC Linguistic Data Type Vocabulary"},
{  "@type": "DefinedTerm",
    "termCode": "language_description",
    "name": "Language Description",
    "url": "http://www.language-archives.org/REC/type.html#language_description"
    "inDefinedTermSet": "OLAC Linguistic Data Type Vocabulary"},
{  "@type": "DefinedTerm",
    "name": "Compressed",
    "url": "http://purl.org/linguistics/gold/Compressed"
    "sameAs": "http://www.linguistics-ontology.org/gold/2010/Compressed"
    "inDefinedTermSet": "GOLD2010"}
]
}

I only put one Gold2010 term in the description because the concepts are hierarchical, so I assume that if one invokes one concept (the more specific concept), the others also apply.

However in specific relevance to linguistic collections of the evidence record. I recommend a structure like this:

{
"@context": "https://schema.org",
"@type": "Collection",
"keywords": [{
  "@type": "DefinedTerm",
    "termCode": "primary_text",
    "name": "Primary Text",
    "url": "http://www.language-archives.org/REC/type.html#primary_text"
    "inDefinedTermSet": "OLAC Linguistic Data Type Vocabulary"},
{  "@type": "DefinedTerm",
    "termCode": "language_description",
    "name": "Language Description",
    "url": "http://www.language-archives.org/REC/type.html#language_description"
    "inDefinedTermSet": "OLAC Linguistic Data Type Vocabulary"},
{  "@type": "DefinedTerm",
    "name": "Compressed",
    "url": "http://purl.org/linguistics/gold/Compressed"
    "sameAs": "http://www.linguistics-ontology.org/gold/2010/Compressed"
    "inDefinedTermSet": "GOLD2010"}
],
"about": ["Linguistics", "Phonetics", "Acoustics", "Bilabial stops", "Berber"],
"audience": ["Linguists", "Phoneticians", "Data Scientists"],
"abstract": "A [fonds](https://en.wikipedia.org/wiki/Fonds) of 250 recorded words with bilabial stops. Segmented and split with annotations. Recorded in Algeria between 2018 and 2019.",  
"hasPart": {
"@id": "SOME URL or DOI ID to the part.
"name": "main-audiofile.wav"},
"contributor": {
"@type": "Person",
"name": "Some language speaker in Berber land"
 },
"creator": {
"@type": "Person",
"name": "EL IDRISSI Mohamed",
"affiliation": {
  "@type": "Organization",
  "@id": "http://www.inalco.fr//#organization",
  "name": {
    "name": "Institut national des langues et civilisations orientales",
    "inLanguage": "fr-fr",
    },
  "alternateName": "INALCO",
  "url": "http://www.inalco.fr"
   },
 },
 
}

Someone with more experience in writing JSON and schema.org JSON+LD should review this for accuracy.

Specific to the application in Linguistics: We often create creative works as the output of our endeavors. In documentary linguistics, one sub-field of linguistics, outputs are multi-file based. That is, we use an elicitation tool (like a standard word list) to guide an elicitation session. We record that elicitation. Then we use ELAN to annotate that created audio file and create a time aligned XML file. Then we take that ELAN file and process it to a visual print media/text output like a PDF as an Interlinear Glossed Text. These objects together form a (small) collection (the elicitation tool, the audio file, the xml annotation, and the derivative output). Then from that word list audio file we might cut out all the audio segments which might include a voiced alveolar stop. We then might use PRAAT and measure the duration of the stops, we would export these duration to a tabular format, such as .CSV.

One way, and IMO the best way, to describe the set of files generate is to describe them as a Collection. This then matches the description of DCMIType Vocabulary: Collection, Dataset, Event, Image, InteractiveResource, MovingImage, PhysicalObject, Service, Software, Sound, StillImage, Text. The DCMIType is dereferencable in that it has stable URLs, using DCMIType then also matches the metadata up with OALC, which points to the DCMIType as the base line (It is my understanding that Pangloss is doing this, because it feeds its content to OLAC).

xxxx.wav
xxxx.eaf
xxxx.textgrid
xxxx.pdf
xxxx.tex

Other schemas in in linguistics are: https://github.com/digitallinguistics/spec arxiv.org has an interesting post on how, from time to time they must expand the number of "fieldsOfStudy" that they use to categorize their pre-print papers. I don't see how "field of study" can be limited to a list then.

Another thought in relation to "fieldOfStudy" is that it might be possible to extend the range of https://schema.org/industry to not be just applicable to job postings, but also creative works.