regen-network / regen-registry-standards

:seedling: RDF and SHACL schemas for Regen Registry
4 stars 1 forks source link

ruuts-c04-metadata #57

Closed S4mmyb closed 11 months ago

S4mmyb commented 1 year ago

Description

Author Checklist

All items are required. Please add a note to the item if the item is not applicable and please add links to any relevant follow up issues.

I have...

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add your handle next to the items reviewed if you only reviewed selected items.

I have...

S4mmyb commented 1 year ago

@wgwz @blushi this PR contains (or will contain) all the metadata for the Ruuts credit class, pilot project, and credit batch. So far I've only finished the Credit Class metadata, so a JSON-LD template, a JSON-LD ops file filled out with the appropriate info, and a new SHACL schema. If you can review what I have so far that would be great!

In creating all of this I had a few questions / comments:

cc @clevinson @ryanchristo

blushi commented 1 year ago

@wgwz @blushi this PR contains (or will contain) all the metadata for the Ruuts credit class, pilot project, and credit batch. So far I've only finished the Credit Class metadata, so a JSON-LD template, a JSON-LD ops file filled out with the appropriate info, and a new SHACL schema. If you can review what I have so far that would be great!

In creating all of this I had a few questions / comments:

  • In the JSON-LD files we use the @context section to define certain fields explicitly as lists and I'm wondering if this is actually needed or if it's redundant. As far as I understand defining expected data types/structures, like regen:ecosystemType, is something we do in the SHACL schema, meaning that in the SHACL schema we would say that regen:ecosystemType needs to be a list with certain requirements (expected data type, minCount/maxCount, etc...). With that in mind it seems redundant to me to do this again at the top of JSON-LD files, especially when we use brackets father down to indicate that it is a list-type and the validation script works just fine without including the "@container": "@list" definitions in the @context. Do you guys think it's ok to just remove that? Is there a reason it's also included there?

This is needed so we don't need to write in the actual JSON-LD data "regen:ecosystemType": { "@list": [ ... ] }. Having it just defined in the context instead makes it more readable and we decided that this should be the standard. This is compacted JSON-LD, please read more about it here: https://www.w3.org/TR/json-ld11-api/#compaction Also since in the SHACL schema, we say it should be a list, if you don't have the field defined as a list in the actual JSON-LD data (whether it's done through the context or directly in the rest of the data), the validation would fail. Please note that having it in both SHACL and the JSON-LD data is not redundant, they are two separate things, SHACL is meant for validating data so the data should conform to what's defined there in order to be considered valid.

  • Building on the above, I'm noticing that for pretty much every field in the JSON-LD files we explicitly define the expected datatype which we have already defined in the SHACL graph. For example to define monitoringFrequency and verificationMethod which are of schema:frequency and list(xsd:string) types respectively, we do:
  "regen:monitoringFrequency": {
    "@type": "schema:frequency",
    "@value": ""
  },  
  "regen:verificationMethod": [
    {
      "@type:": "xsd:string",
      "@value": ""
    }
  ],

rather than just doing something like this:

  "regen:monitoringFrequency": "",  
  "regen:verificationMethod": [ ],

Is there a reason for this? Is it just because we just want to know what type of element to put there?

If you put in the "@context": "regen:monitoringFrequency": { "@type": "schema:frequency" } then in the data, you can just use "regen:monitoringFrequency": "" This is the whole point of defining that in the context instead of directly in the data. But it has to stay in the JSON-LD in some way (ie here in the context) because otherwise how would you know it's a schema:frequency? Having the data doesn't mean you have access to the SHACL schema, and again SHACL is meant for validating data, not for defining the data itself.

  • In terms of overall architecture of the SHACL sub-folder, I found it a bit confusing how we were structuring and storing re-usable graph components. For instance, in the shacl/common.ttl file we store offsetGenerationMethod, ToucanURI, & ProjectActivity shape elements, but we store methodology shapes independently in the shacl/methodologies/methodology.ttl. Similarly, I know we use the NameURL shape in Credit Class, Project, and Batch level SHACL schemas and JSON-LD files, but the actual NameURL shape lives in the shacl/projects/common.ttl file and not in something more universal. With that said, I'm wondering if we want to do a bit of restructuring / rearchitecting of this repo and break things out into more composable elements. Thoughts?

cc @clevinson @ryanchristo

Yeah I guess we could make sure that the shapes that are being used in different classes are being put in shacl/common.ttl instead

S4mmyb commented 1 year ago

Awesome, thanks so much @blushi ! That's super helpful context on why we are including things and makes sense to me now. I was just a bit confused earlier and wanted to understand why we used the @context to pre-define datatypes. I just updated with your comments and it looks a lot cleaner already! Hopefully will be moving onto the project and batch metadata soon

wgwz commented 1 year ago
  • In terms of overall architecture of the SHACL sub-folder, I found it a bit confusing how we were structuring and storing re-usable graph components. For instance, in the shacl/common.ttl file we store offsetGenerationMethod, ToucanURI, & ProjectActivity shape elements, but we store methodology shapes independently in the shacl/methodologies/methodology.ttl. Similarly, I know we use the NameURL shape in Credit Class, Project, and Batch level SHACL schemas and JSON-LD files, but the actual NameURL shape lives in the shacl/projects/common.ttl file and not in something more universal. With that said, I'm wondering if we want to do a bit of restructuring / rearchitecting of this repo and break things out into more composable elements. Thoughts?

Yeah I guess we could make sure that the shapes that are being used in different classes are being put in shacl/common.ttl instead

I'm open to ideas about how to restructure things because it is easy to do but I wouldn't want to do too much of this. I think the suggestion of moving NameURL out of shacl/projects/common.ttl into shacl/common.ttl makes good sense though.

clevinson commented 1 year ago

Is it worth reconsidering our use of SHACL schemas for credit class metadata?

As @blushi indicated in her earlier comment, SHACL schemas are primarily about dataset validation. I don't think we ever actually will need to validate that a given dataset is a valid "C04-CreditClass" shape, as there will only ever be one instance of such a dataset (the one dictated here in this PR). And if we ever are making upgrades to the C04-CreditClass JSON-LD, then it will likely be in a way that our previous SHACL schema for C04-CreditClass isn't accounting for (e.g. adding a new field to the JSON-LD).

I think a more appropriate use of SHACL in the context of registry standards would be:

Some docs / references for what I'm talking about w/ inheritance:

If we do want to make this change across all credit classes, then that probably warrants a separate PR. But if we are aligned with this approach, then I think we can probably strip the SHACL schema from this Ruuts specific PR (unless @S4mmyb you want to draft a proposed set of updates to the regen:CreditClassVersionShape SHACL schema (as a sidenote- I'd be curious to understand why we call this regen:CreditClassVersion instead of regen:CreditClass)

blushi commented 1 year ago

Is it worth reconsidering our use of SHACL schemas for credit class metadata?

As @blushi indicated in her earlier comment, SHACL schemas are primarily about dataset validation. I don't think we ever actually will need to validate that a given dataset is a valid "C04-CreditClass" shape, as there will only ever be one instance of such a dataset (the one dictated here in this PR). And if we ever are making upgrades to the C04-CreditClass JSON-LD, then it will likely be in a way that our previous SHACL schema for C04-CreditClass isn't accounting for (e.g. adding a new field to the JSON-LD).

I think a more appropriate use of SHACL in the context of registry standards would be:

  • we have a singular SHACL schema for credit class metadata. All credit class metadata JSON-LD's must be valid with respect to this SHACL schema

Indeed this is one thing that I was proposing in my PR: https://github.com/regen-network/regen-registry-standards/pull/56#discussion_r1231073593 because right now, we are just repeating more or less the same constraints in all credit classes SHACL schemas. But in addition to a generic credit class SHACL schema, we might still need credit class specific schemas to document any additional fields that might only appear in certain credit class metadata (eg regen:carbonOffsetStandard in C02 or regen:tokenizationSource in C03), although agreed this might not be really used, only for documentation purposes...

  • we have generic SHACL schemas defined for the the following:

    • project (regen:ProjectShape)
    • credit batch (regen:CreditBatchShape)
  • Additionally, we would have credit-class specific SHACL schemas ccdefined for each of:

    • project (inheriting from the generic regen:ProjectShape)
    • credit batch (inheriting from the generic regen:CreditBatchShape)

👍

Some docs / references for what I'm talking about w/ inheritance:

If we do want to make this change across all credit classes, then that probably warrants a separate PR. But if we are aligned with this approach, then I think we can probably strip the SHACL schema from this Ruuts specific PR (unless @S4mmyb you want to draft a proposed set of updates to the regen:CreditClassVersionShape SHACL schema (as a sidenote- I'd be curious to understand why we call this regen:CreditClassVersion instead of regen:CreditClass)

The regen:CreditClassVersion naming is referencing our registry-server postgres table credit_class_version that was meant to support multiple versions of a credit class initially but this should be considered legacy, this is basically what I did in my currently open PR: https://github.com/regen-network/regen-registry-standards/pull/56/files#diff-aee3c11f2f6cabf346b956fc928f729c4c984b5f18832b583f1af9fee5c825ed

blushi commented 11 months ago

While working on https://github.com/regen-network/regen-registry/issues/1756, I've realized there are a few issues in the C04 metadata, see my comments above

blushi commented 11 months ago

Making this PR ready for review. @clevinson and I agreed to just refactor this for now to fit the latest repo structure (no jsonld example for classes and only class specific constraint in the related class SHACL schema), since the current metadata has already been submitted on-chain as is. I've also verified that the C04 on-chain metadata IRI regen:13toVgBisQqEmauHntQsW6mwpz71RSTwsjzhn9gxCQ16tdRjHPHTRoK.rdf matches the C04 jsonld in this PR. The other suggestions/fixes will be submitted as part of another PR (new issue: https://github.com/regen-network/regen-registry-standards/issues/65)