Common format for metadata and test specifications

As discussed in https://github.com/perma-id/w3id.org/pull/3786 and https://github.com/perma-id/w3id.org/pull/3801 there is a need for common format for test specifications and metadata. At least three options are available, each with pros and cons:

Custom inline format in the source files Example: ##TESTv1 '/mypath --header "Accept: text/html"' "https://my-target-domain.com/test.html in the .htaccess or maintainer:https://github.com/abc123 in the README.md file. Pro: Minimalistic, no duplications, users just extend existing files and adds additional information on intuitive locations Con: Regex-based machine readability is limited and may lead to errors
Structured inline format in the README.md file microformat as suggested by bfabio. Should be RDF serialized e. g. as json-ld or yaml-ld since we are already in the linked data domain. Example:
```
{
"@context": "https://w3id.org/vocab",
"@type": "W3IdSubpath",
"maintainer": [
"https://github.com/<some_user>",
"..."
],
"tests": [
{
  "@type": "RedirectTest",
  "url": "https://w3id.org/...",
  "headers": "...",
  "expected: "..."
}
]
}
```
Pro: Embedded data is machine readable, no custom format, no additional file, data can be stored on multiple locations in the document Con: More complex, embedded data needs to me extracted from the README.md file (e. g. filtering all code blocks with format json-ld, parsing, filtering by @type)
Structured format in a dedicated file like 2. but in a meta.json or meta.yml file. Pro: Machine readable without custom parsing, data could also be fetched by any linked data crawler Con: Additional file to maintain, separation from other documentation, duplication of information

@davidlehn, @bfabio: What do you think?

I would go for option 2, probably with a yaml-ld based representation. We can ask people to annotate the block in a certain manner, e.g., ```yaml-ld or similar. And ask people to provide both their names and github id.

The test part, I would only support when a new w3id is created or updated, since there was a test before for all w3ids and it kept failing on older unrelated w3ids that are currently failing for some other reason.

The other issue we may encounter is that if a repository has folders, some of the sub w3ids may be also maintained by the parent readme, or they may have READMEs with new maintainers.

Finally the w3id https://w3id.org/vocab is already reserved, we would have to choose a new one

I suggest the namespace be https://w3id.org/w3id[...].
JSON-LD people have often done something like https://w3id.org/w3id/v1 for the context, which would redirect to ... https://w3id.org/w3id/contexts/v1.jsonld, I guess? This will be an exception for the not hosting data rule. :-)
Property/type URLs would be something like https://w3id.org/w3id#FooBar.
As mentioned in one of the other related issues, I had thought of (eventually) allowing one of .w3id.{jsonld,json,jsonc,yaml,ttl,nq,etc}. As a JSON-LD person, I would convert to JSON-LD, then frame it to a format that is easily processed by JSON tooling. Probably support JSON and YAML to start. And shut off apache access to those files by default from the top level config so they are not served up.
This is all hierarchical, so will need to be designed to load all configs up the tree and process them all in some smart way.
I'm not quite sure why people use READMEs at all! :-) So much easier to put a few minimal comments in the .htaccess file. Who reads those READMEs? Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.
If people really want to put machine readable metadata in READMEs, I guess that's fine? Checking for and extracting it is a hassle over a config file. And embedded markup probably won't have the same native syntax checking that a config file with extension would have.
There's room to allow all of this, at least for now, and experiment to see what works best. None of it is difficult to setup, but we might mark it as experimental and simplify later.
I assume this will change over time, so I had thought of also having a main "policy" version sort of thing that would enforce structure. I'm not sure what that even means, but debian packaging policy seems to work well.

I'm not quite sure why people use READMEs at all! :-) So much easier to put a few minimal comments in the .htaccess file. Who reads those READMEs? Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

Well, I think people just follow the instructions we put in: https://github.com/perma-id/w3id.org#creating-a-new-identifier :)

[@davidlehn]

I'm not quite sure why people use READMEs at all! :-) So much easier to put a few minimal comments in the .htaccess file. Who reads those READMEs? Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

As @dgarijo says, w3id.org instructions are a big reason, and I think it important to consider why those instructions exist before bashing or trashing them.

.htaccess comments are not always obvious to less-technical humans, and often include far more technical discussion than the higher-level summary that is typically put into README.md files, which is meant for humans from the start. Even when lower-level technical details are included in the README.md, it's generally presented in a more human friendly style than what is found in most (if not all) .htaccess.

I think it's good for those README.md files to be in the open, for various reasons, including but not limited to exposure via search engines, which can lead people to w3id.org itself as well as to the content behind the w3id.org-based redirects.

1. Custom inline format in the source files
   Example:
   `##TESTv1 '/mypath --header "Accept: text/html"' "https://my-target-domain.com/test.html` in the `.htaccess` or `maintainer:https://github.com/abc123` in the `README.md` file.
   Pro: Minimalistic, no duplications, users just extend existing files and adds additional information on intuitive locations
   Con: Regex-based machine readability is limited and may lead to errors

@simontaurus thanks for the recap. I actually prefer option 1 over the others, especially over option 2, I know that a loosely structured non-semantic format is probably not a popular opinion here :smile:.

I'd also add that we can simplify even more, without a new format: if we are willing to be GitHub/GitLab specific, @username is already a thing in Markdown files on those forges that means "that user" and the implementation of the regexp for extracting it is straightforward and error proof.

@dgarijo

I would go for option 2, probably with a yaml-ld based representation. We can ask people to annotate the block in a certain manner, e.g., ```yaml-ld or similar. And ask people to provide both their names and github id.

Looking at option 2: it has the same con as option 1, but worse: we'd still need a regexp to extract that block with a way more complex syntax within the block (fe. we'd need to escape ```).

Regarding the rule testing issue, I didn't investigate it in depth, but it makes sense to me to have the tests close to the actual rule.

Structured inline format in the README.md file microformat as suggested by bfabio. Should be RDF serialized e. g. as json-ld or yaml-ld since we are already in the linked data domain. Example:
{
  "@context": "https://w3id.org/vocab",
  "@type": "W3IdSubpath",
  "maintainer": [
    "https://github.com/<some_user>",
    "..."
  ],
  "tests": [
    {
      "@type": "RedirectTest",
      "url": "https://w3id.org/...",
      "headers": "...",
      "expected: "..."
    }
  ]
}
Pro: Embedded data is machine readable, no custom format, no additional file, data can be stored on multiple locations in the document Con: More complex, embedded data needs to me extracted from the README.md file (e. g. filtering all code blocks with format json-ld, parsing, filtering by @type)

I feel something this "complex" could lead us into a world of pain with maintenance. For starters, we'd need a validator, and even then, I bet the support requests will increase. "Why doesn't it work? Oh, there's a comma missing. Oh, it's "maintainer" non "maintainers". Did I close that parenthesis?"

Let's keep in mind sometimes those files are maintained by people new to the world of semantic data - or the world of open collaboration for that matter.

@davidlehn

Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

I think READMEs are really useful for describing the context and to serve as a sort of "landing page" for the path. I guess there should be a global rule to not publish them, but having those nicely formatted and editable within the GitHub UI makes them for a great answer to "what is this abstract/technical stuff all about?".

perma-id / w3id.org

Common format for metadata and test specifications #3811