iop-alliance / OpenKnowHow

A metadata specification to enable the collection of distributed, standardised metadata of open source hardware designs
GNU General Public License v3.0
2 stars 1 forks source link

RDF IRI(-generation) & Data Location/Storage #4

Open hoijui opened 9 months ago

hoijui commented 9 months ago

Current Data Aggregation Process

  1. Projects are found on different platforms by different means.
  2. Their meta-data is extracted, either by:
    1. just copying the okh.toml file out of their storage/repo
    2. assembling an okh.toml file by using the hosting platforms API
  3. the TOML data gets converted to RDF

The problem

The idea of LinkedData, and very much ours for OKH too, is to support a distributed data system. Furthermore, all RDF data - more specifically each subject - is uniquely identifiable by its IRI. An IRI is simply a unicode-version of a URL. \ This pushes two requirements onto us:

  1. It is very much recommended - and we should ensure this to be the case - that a subject is available under its IRI.
  2. If we generate the RDF on our server (be it centralized or decentralized), and we make it available for the public, we would necessarily have to do it under a domain (which thus becomes an essential part of the RDFs IRI) that we control, and that the original project does not control. This means, the data would not be distributed anymore, and it also means, that each data-collector would host each projects RDF under their URL, using that URL as IRI, which means, we would end up with the same project/data available under different IRIs, which are supposed to be unique identifiers, meaning we would end up with duplicates. \ -> very bad!

We could choose to do one of two things:

  1. use a domain under the control of the original project (e.g. its github pages URL or a perma-URL they registered for this purpose), but actually host the RDF under our own domain, violating the first requirement above, or
  2. host it on our domain, and also using the correct hosting location as its IRI, which satisfies the first requirement above, but violates the second.

In theory, there is a third option: \ Each project generates their RDF by themselfs in a CI, and then hosts it permanently (at least each release version of it plus the latest development one). That though, is very, very unlikely, unstable, difficult to maintain and update, .... and only possible for git-hosted (or other SCM-hosted) projects. \ -> not really an option.

hoijui commented 9 months ago

DING, DING, DING, DING, ...

:O Now, writing the above, I got an idea! There is actually a fourth option: We could use a similar approach like W3ID does, to host the data. There is one (or optionally a few -> redundant) git repos, that contain/host all the RDF data. Multiple parties that aggregate the data, have push-access to it, and regularly, push to it, in an automated fashion, when crawling/generating the data. This means, both data-gatherers and individual projects could push data. This allows for a somewhat distributed-ish, but at the very least decentralized/federated power over the RDF data, and as a huge beneficial side-effect, it would allow to efficiently distribute the data-gathering load.