rstudio / pins-r

Pin, discover, and share resources
https://pins.rstudio.com
Other
305 stars 63 forks source link

JSON-LD #177

Closed javierluraschi closed 3 years ago

javierluraschi commented 4 years ago

Consider supporting JSON-LD instead of data.txt format.

javierluraschi commented 4 years ago

TL;DR; we still want to use data.txt for the top-level data catalog, mostly to list paths to resources; however, for metadata, like describing columns, etc. we should try to integrate JSON-LD.

Investigation

There is a lot of value in JSON-LD; even for simple things, like a FAQ datasets, I wish we could do something in R like (and maybe is possible already):

faq_data <- list(list(q = "What year is it?", a = "2020"))
schemaorg::write(faq_data, schema = "faqpage", "myfaq.json")

Would love to extend that to pins to store the JSON-LD in a particular location, say S3 as follows:

schemaorg::create(faq_data, schema = "faqpage") %>%
  pins::pin(name = "myfaq", board = "github")

So far so good and definitely excited to get something like this integrated into pins, at some point.

The trickier piece is this concept of DataCatalog which Google seems to suggest we use.

Apparently, a DataCatalog has also a JSON-LD schema, but things get complicated really fast. First is hard to find examples, I did find one:

http://dcat.be/

But then, from schema.org Wikipedia page, looks like there are three formats one can use: Microdata, RDF, or JSON-LD and recommends adding a bunch of additional fields, so for instance, a data catalog for iris and mtcars in data.txt would look like:

- name: iris
  path: iris.csv
- name: mtcars
  path: mtcars.csv

While, I think, in schema.org would look like:

<script type="application/ld+json">
{ 
  "@context": "http://schema.org/",
  "@type": "DataCatalog",
  "name": "Catalog",
  "datasets": 
  [
    { 
       "@type": "Dataset",
       "value": "https://domainname/iris.csv"
    },
    { 
       "@type": "Dataset",
       "value": "https://domainname/mtcars.csv"
    }
  ]
}
</script>

I think? The code above might honestly be wrong and in general, makes me think that DataCatalog is an overly complicated approach to list dataset. It also does not seem to be used that often for discoverability since the JSON-LD is usually embedded into a webpage in a Githubissues.

  • Githubissues is a development platform for aggregating issues.