RDF IRI(-generation) & Data Location/Storage

Current Data Aggregation Process

Projects are found on different platforms by different means.
Their meta-data is extracted, either by:
1. just copying the okh.toml file out of their storage/repo
2. assembling an okh.toml file by using the hosting platforms API
the TOML data gets converted to RDF

The problem

The idea of LinkedData, and very much ours for OKH too, is to support a distributed data system. Furthermore, all RDF data - more specifically each subject - is uniquely identifiable by its IRI. An IRI is simply a unicode-version of a URL. \ This pushes two requirements onto us:

It is very much recommended - and we should ensure this to be the case - that a subject is available under its IRI.
If we generate the RDF on our server (be it centralized or decentralized), and we make it available for the public, we would necessarily have to do it under a domain (which thus becomes an essential part of the RDFs IRI) that we control, and that the original project does not control. This means, the data would not be distributed anymore, and it also means, that each data-collector would host each projects RDF under their URL, using that URL as IRI, which means, we would end up with the same project/data available under different IRIs, which are supposed to be unique identifiers, meaning we would end up with duplicates. \ -> very bad!

We could choose to do one of two things:

use a domain under the control of the original project (e.g. its github pages URL or a perma-URL they registered for this purpose), but actually host the RDF under our own domain, violating the first requirement above, or
host it on our domain, and also using the correct hosting location as its IRI, which satisfies the first requirement above, but violates the second.

In theory, there is a third option: \ Each project generates their RDF by themselfs in a CI, and then hosts it permanently (at least each release version of it plus the latest development one). That though, is very, very unlikely, unstable, difficult to maintain and update, .... and only possible for git-hosted (or other SCM-hosted) projects. \ -> not really an option.

DING, DING, DING, DING, ...

:O Now, writing the above, I got an idea! There is actually a fourth option: We could use a similar approach like W3ID does, to host the data. There is one (or optionally a few -> redundant) git repos, that contain/host all the RDF data. Multiple parties that aggregate the data, have push-access to it, and regularly, push to it, in an automated fashion, when crawling/generating the data. This means, both data-gatherers and individual projects could push data. This allows for a somewhat distributed-ish, but at the very least decentralized/federated power over the RDF data, and as a huge beneficial side-effect, it would allow to efficiently distribute the data-gathering load.

iop-alliance / OpenKnowHow

RDF IRI(-generation) & Data Location/Storage #4

Current Data Aggregation Process

The problem