perma-id / w3id.org

Website source code for w3id.org.
https://w3id.org/
277 stars 1.19k forks source link

Common format for metadata and test specifications #3811

Open simontaurus opened 8 months ago

simontaurus commented 8 months ago

As discussed in https://github.com/perma-id/w3id.org/pull/3786 and https://github.com/perma-id/w3id.org/pull/3801 there is a need for common format for test specifications and metadata. At least three options are available, each with pros and cons:

  1. Custom inline format in the source files Example: ##TESTv1 '/mypath --header "Accept: text/html"' "https://my-target-domain.com/test.html in the .htaccess or maintainer:https://github.com/abc123 in the README.md file. Pro: Minimalistic, no duplications, users just extend existing files and adds additional information on intuitive locations Con: Regex-based machine readability is limited and may lead to errors

  2. Structured inline format in the README.md file microformat as suggested by bfabio. Should be RDF serialized e. g. as json-ld or yaml-ld since we are already in the linked data domain. Example:

    {
    "@context": "https://w3id.org/vocab",
    "@type": "W3IdSubpath",
    "maintainer": [
    "https://github.com/<some_user>",
    "..."
    ],
    "tests": [
    {
      "@type": "RedirectTest",
      "url": "https://w3id.org/...",
      "headers": "...",
      "expected: "..."
    }
    ]
    }

    Pro: Embedded data is machine readable, no custom format, no additional file, data can be stored on multiple locations in the document Con: More complex, embedded data needs to me extracted from the README.md file (e. g. filtering all code blocks with format json-ld, parsing, filtering by @type)

  3. Structured format in a dedicated file like 2. but in a meta.json or meta.yml file. Pro: Machine readable without custom parsing, data could also be fetched by any linked data crawler Con: Additional file to maintain, separation from other documentation, duplication of information

@davidlehn, @bfabio: What do you think?

dgarijo commented 8 months ago

I would go for option 2, probably with a yaml-ld based representation. We can ask people to annotate the block in a certain manner, e.g., ```yaml-ld or similar. And ask people to provide both their names and github id.

The test part, I would only support when a new w3id is created or updated, since there was a test before for all w3ids and it kept failing on older unrelated w3ids that are currently failing for some other reason.

The other issue we may encounter is that if a repository has folders, some of the sub w3ids may be also maintained by the parent readme, or they may have READMEs with new maintainers.

Finally the w3id https://w3id.org/vocab is already reserved, we would have to choose a new one

davidlehn commented 8 months ago
dgarijo commented 8 months ago

I'm not quite sure why people use READMEs at all! :-) So much easier to put a few minimal comments in the .htaccess file. Who reads those READMEs? Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

Well, I think people just follow the instructions we put in: https://github.com/perma-id/w3id.org#creating-a-new-identifier :)

TallTed commented 8 months ago

[@davidlehn]

  • I'm not quite sure why people use READMEs at all! :-) So much easier to put a few minimal comments in the .htaccess file. Who reads those READMEs? Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

As @dgarijo says, w3id.org instructions are a big reason, and I think it important to consider why those instructions exist before bashing or trashing them.

.htaccess comments are not always obvious to less-technical humans, and often include far more technical discussion than the higher-level summary that is typically put into README.md files, which is meant for humans from the start. Even when lower-level technical details are included in the README.md, it's generally presented in a more human friendly style than what is found in most (if not all) .htaccess.

I think it's good for those README.md files to be in the open, for various reasons, including but not limited to exposure via search engines, which can lead people to w3id.org itself as well as to the content behind the w3id.org-based redirects.

bfabio commented 8 months ago
1. Custom inline format in the source files
   Example:
   `##TESTv1 '/mypath --header "Accept: text/html"' "https://my-target-domain.com/test.html` in the `.htaccess` or `maintainer:https://github.com/abc123` in the `README.md` file.
   Pro: Minimalistic, no duplications, users just extend existing files and adds additional information on intuitive locations
   Con: Regex-based machine readability is limited and may lead to errors

@simontaurus thanks for the recap. I actually prefer option 1 over the others, especially over option 2, I know that a loosely structured non-semantic format is probably not a popular opinion here :smile:.

I'd also add that we can simplify even more, without a new format: if we are willing to be GitHub/GitLab specific, @username is already a thing in Markdown files on those forges that means "that user" and the implementation of the regexp for extracting it is straightforward and error proof.

@dgarijo

I would go for option 2, probably with a yaml-ld based representation. We can ask people to annotate the block in a certain manner, e.g., ```yaml-ld or similar. And ask people to provide both their names and github id.

Looking at option 2: it has the same con as option 1, but worse: we'd still need a regexp to extract that block with a way more complex syntax within the block (fe. we'd need to escape ```).

Regarding the rule testing issue, I didn't investigate it in depth, but it makes sense to me to have the tests close to the actual rule.

  1. Structured inline format in the README.md file microformat as suggested by bfabio. Should be RDF serialized e. g. as json-ld or yaml-ld since we are already in the linked data domain. Example:
{
  "@context": "https://w3id.org/vocab",
  "@type": "W3IdSubpath",
  "maintainer": [
    "https://github.com/<some_user>",
    "..."
  ],
  "tests": [
    {
      "@type": "RedirectTest",
      "url": "https://w3id.org/...",
      "headers": "...",
      "expected: "..."
    }
  ]
}

Pro: Embedded data is machine readable, no custom format, no additional file, data can be stored on multiple locations in the document Con: More complex, embedded data needs to me extracted from the README.md file (e. g. filtering all code blocks with format json-ld, parsing, filtering by @type)

I feel something this "complex" could lead us into a world of pain with maintenance. For starters, we'd need a validator, and even then, I bet the support requests will increase. "Why doesn't it work? Oh, there's a comma missing. Oh, it's "maintainer" non "maintainers". Did I close that parenthesis?"

Let's keep in mind sometimes those files are maintained by people new to the world of semantic data - or the world of open collaboration for that matter.

@davidlehn

Many/all people without wildcards in their rules forget to disallow access to the READMEs so they are often live, I think.

I think READMEs are really useful for describing the context and to serve as a sort of "landing page" for the path. I guess there should be a global rule to not publish them, but having those nicely formatted and editable within the GitHub UI makes them for a great answer to "what is this abstract/technical stuff all about?".