Sustainability Vocabulary Design

onthebreeze commented 5 months ago

This ticket is intended to provoke discussion about the form of the sustainability vocabulary component of UNTP. https://uncefact.github.io/spec-untp/docs/specification/Vocabularies. The intent is that this vocabulary provides a standardised topic map for classification of all ESG claims in product passports and conformity credentials. If done right, this topic map will facilitate the aggregation of product / shipment level data to enterprise / facility level data that is needed for corporate sustainability disclosures. Therefore, a good understanding of those corporate disclosures is a pre-requisite for designing a useful UNTP sustainability vocabulary. Some handy references:

GRI : Very well established global framework.
SASB (XBRL Version : US corporate reporting frmework
IFRS : Emerging International framework from the same body that has managed financial standards for decades. Based on SASB.
ESRS : EU regulation - blends GRI and IFRS. ESRS Overview from KPMG.
UNEP view on ESRS and slide desk.

In short, there's lots of work already done on sustainability vocabularies. What tends to have the most impact is legislation and so the ESRS is a good reference. But as the UN we should also map to SDGs and the UN environment programme topics. Fortunately that work seems to have been done for us by the UNEP - excel sheet attached.

ESRS-UNEP-FI-Topics-Mapping.xlsx

So, my current thinking is that we should recognise that ESRS is a leading example of legislated requirements that will drive global behaviour - and that it is informed already by the best of GRI and IFRS. And, since UNEP has already done the hard work of creating a ESRS topic map and mapping it to UN classifications - we can do worse than just make a JSON-LD vocabulary out of the UNEP / ESRS topic map - it's likely to be the best starting point.

ReLOG3PSNE commented 5 months ago

Hi @onthebreeze and All, we are ourselves very much focused, with our startup, in combining the most renowed and recognized tools, frameworks, standards, best practices around 3P sustainability, and fully agree/align with Steve comment above. On our end, we may recommend/suggest to, eventually, also consider the below, yet, as suggested by Steve, within the UNEP umbrella / equivalence mapping:

Carbon Disclosure Project (CDP) (https://www.cdp.net/en)
Climate Disclosure Standards Board (CDSB) (https://www.cdsb.net/)
ESG Data Convergence Initiative (EDCI) (https://www.esgdc.org/)
EU Carbon Border Adjustment Mechanism (CBAM) (https://taxation-customs.ec.europa.eu/carbon-border-adjustment-mechanism_en)
Science Based Targets (SBTi) (https://sciencebasedtargets.org/)

Here we can contribute

onthebreeze commented 5 months ago

Also relevant : https://www.linkedin.com/posts/david-carlin7_nature-tnfd-biodiversity-activity-7157771371970707456-N_85/?utm_source=share&utm_medium=member_ios

ReLOG3PSNE commented 5 months ago

[heart] Stefano Negrini reacted to your message:

From: Steven Capell @.> Sent: Tuesday, January 30, 2024 7:03:06 AM To: uncefact/spec-untp @.> Cc: Stefano Negrini @.>; Comment @.> Subject: Re: [uncefact/spec-untp] Sustainability Vocabulary Design (Issue #12)

Also relevant : https://www.linkedin.com/posts/david-carlin7_nature-tnfd-biodiversity-activity-7157771371970707456-N_85/?utm_source=share&utm_medium=member_ios

— Reply to this email directly, view it on GitHubhttps://github.com/uncefact/spec-untp/issues/12#issuecomment-1916196785, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWUHLI6NCXBSQRV6NK73YG3YRCLKVAVCNFSM6AAAAABB4LKTRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJWGE4TMNZYGU. You are receiving this because you commented.Message ID: @.***>

Semwodg commented 5 months ago

Steve, in response to your suggestion that ESRS-UNEP-FI-Topics-Mapping.xlsx is used, I created a RDF graph of that mapping.

Initial commits can be viewed in the following repo:

https://github.com/Semwodg/ESRS-UNEP_FI_TopicsMapping

Happy to showcase this using an instance of Stardog Cloud where I have a couple of queries saved and can demonstrate the value. I've also generated JSON LD versions of the model and the data in the repo linked above.

JohnOnGH commented 5 months ago

Very interesting discussion around vocabulary, terminology and semantics. Got me thinking about two things:

1) If/when "we" (the Rec 49 team members) create such a mapping, who looks after it afterwards? Who "vouches" for it? Who "authorises" it? There will be (there are) many such mappings, who say which one is right or has "authority" in a specific jurisdiction?

2) Support and maintenance post definition. Once the UNTP (and any associated artefacts) are written, how will they be supported and maintained?

Semwodg commented 5 months ago

Thinking the same thing John. Below outlines how I've engaged with the issues you're considering in past projects.

I'm guessing if the authoritative custodians of the vocabularies won't engage with the curation of vocab mapping (e.g. for maintenance/change-management), even when provided simple web UI's to run that maintenance (including user access management run at UNTP), then UNTP would need to maintain the mapping and seek "acceptance" from the mapped vocabulary custodians.

Management of mapping, including web UI capability (user friendly, not for expert ontologists - dynamic workflow management) is available out of the box from some vendors (e.g. TopBraid EDG from Topquadrant), or UNTP could build the interfaces using available JS (e.g. web application connected to the quadstore via API) - assuming that the actual change will be exported from a quadstore as required serialisation (e.g. JSON LD) and published for end users.

Standard (and free) tools can be employed by third parties to develop changes to vocabs (e.g. Protege) , then changes to vocabs submitted to a governance board at UNTP or attended by the UNTP. That assumes that controlled vocabularies are managed as RDF.

This however is only guessing on my part.

The Open Geospatial Consortium (OGC) engages industry and standards custodian orgs to build mapping for controlled vocabularies, and do so by employing a "testbed" approach, including paying the winners in a bidding war to run the mapping using OCG methodology.

If the reference data from controlled vocabs and mapped vocabs are used in machine automated processes in the real world, then the vocabulary values used are likely to be guard or trigger conditions used in process logic. If so, sers of mapped vocabularies may build dependencies on mapped vocabularies within their enterprise. This is especially likely when trading across borders where mapped controlled vocabularies are required. So curatorial activities will be required along with release management. Those (controlled vocab) values will also be used in reporting and creation of evidence to support ESG claims, At least this is what I'm imagining.

These are some suggested approaches to maintaining mapping across controlled vocabularies, and I believe we need to be thinking about this prior to publishing mapped vocabularies, or immediately after doing so. But again, I'm really guessing here.

ReLOG3PSNE commented 5 months ago

@onthebreeze & All, not only I do agree with @JohnOnGH & @Semwodg on the need but I do like/support both suggestions shared by @Semwodg based on hand-on experience. Particularly, the open, co-creational and, in relation to the OGC, the "gaming" and motivating approach sounds very good to me when. i.e., we are talking about a topic that requires (as we are all discussing/aware) an important cultural change in several approaches.

Side question to @Semwodg, for my own understanding: I liked very much the demo on the graph / Stardog Cloud and, considering that I do not have experience with the tool you presented, I was wondering/seeking confirmation if the tool could be paired with AI/ML/data science tools as well (yesterday you mentioned the lenghtly activities in case of manual intervention / overlook (I am thinking at further possible scaling up developments where we may increase the automation/digitization level, yet leaving the "human oversight" approach as you mentioned

Semwodg commented 5 months ago

@ReLOG3PSNE I'm pleased you liked the demo of the linked vocabs in Stardog Cloud. It was quite trivial and didn't attempt to apply best practise with vocabulary mapping, instead focused on direct conversion from a spreadsheet that itself had minimal meaningfulness. Was really seeking to point to the shortfalls of such an approach (taking a two dimensional structure and blindly converting it to a multi-dimensional structure).

In answer to your question about paring with AI/ML/data science tools, the simple answer is yes. But I suspect more explanation is demanded.

Regarding scalability, a quadstore can be thought of as an "smart index" from a typical RDBMS perspective. It builds what is referred to as a TBox (ontology - semantic schema), and an ABox (actual data values). This promotes "data virtualisation" where the actual data values are stored in a data lake, or file system, or object store, or similar where the data elements have persistent URI. An OLTP system need only be sure to persist the persistent URI that is resolvable via some connector/protocol. Stardog has connectors for most cloud provider/vendor solutions in this space, but so do other quadstore management platforms.

It is worth taking a look at the IRI guidelines at linked.data.gov.au to see a standard for applying IRI's in a lined data ecosystem.

Regarding ML integration, I'm currently a Stardog customer and am helping with beta testing of Voicebox, which is Stardog's approach to querying a KG using a vector DB/LLM to enable natural language querying. I'm testing for hallucination tendencies. It is doing OK but its early days. The Voicebox capability also enables ontology creation via natural language input (construct SPARQL queries), but I'm not currently testing that so I have yet to develop an opinion at this time. What I note however, is that using ontology inputs to a vector DB feed by a LLM reduces the need to train the LLM as the ontology delivers the context for answering questions around a particular domain (lots to explore in this regard).

I have created, and have at hand, governance ontology that interacts with authorisation/authentication services (like OAuth, SAML, etc) that enables governed curation of ontology artefacts and there is much automation available in that space (inc: notifications and release processing). Vendors like TopQuadrant have preconfigured capability in this space, and it's very good. Currently tinkering with a provenance graph for making changes to ont/vocabs but just catching up with some of the good folks at OGC.

Re: scalability and performance, there are some outstanding candidate quadstore vendors like RDFox who deliver performance to some huge oltp systems in finance.

For lightweight implementations that can scale to the requirements of providing for a high throughput OLTP architecture, sure there are scalable/performant options available in the marketplace, if that is a required thing.

Scalability has multiple aspects depending on real world demands, but yes I think most use cases are already catered for today in implemented capabilities in the marketplace.

Having said all that, it is my firm belief that for matters around scalability and performance, nothing beats modular semantic information architecture that is informed by real world users.

Semwodg commented 5 months ago

Opps Where I said "persistent URL" I should have said "persistent IRI".

ReLOG3PSNE commented 4 months ago

Hi @Semwodg , thanks a lot for getting back to my query(ies) with such detailed yet clear explanation!

Wrt our UNTP work, I feel that the Stardog tool could have a possible role in supporting clear communication and explainability on the HOW we'll do / we did our work, i.e. down the road in the specification section?

Just a thought from my side

seewodg commented 3 months ago

@onthebreeze I like your diagram that shows upper level vocabulary because it effectively enables UNTP to take, a custodianship roll that is more confined to revolving mapping disputes as custodians of the lower level managed vocabularies do the mapping to the upper level vocab. Those custodians of the lower level managed vocabs can also assist with development of the upper level vocabulary including requesting extensions to that upper level vocab where it fails to adequately cater for existing lower level vocabs.

Effectively UNTP will take a governance role over changes to the upper level ontology (once a basic model is built) and provide governance services to assist mapping activities. This could potentially be managed using Github.

Once lower level vocabularies are mapped, then meaning can be resolved via the transitive journey across the upper level vocab to the two lower level vocabularies. Profiles, rules, and constraints to assist processes like validation and conformance can then be applied that accounts for knowledge of implemented mapping.

The trick will be to get the level of abstraction right in the upper level vocabulary, but that should get easier once several lower level vocabularies are mapped.

Of course this would could involve several actual upper level vocabs, as your diagram seems to indicate.

seewodg commented 3 months ago

Does anybody have an opinion regarding the use of SKOS to model upper level vocabularies?

I'm thinking that SKOS will enable standardised mapping including accommodation for structure as managed concepts as classification systems, not class or property axioms is the case with OWL. In reference to "concepts", SKOS states:

Note that these are facts about the thesaurus or classification scheme itself, such as "concept X has preferred label 'Y' and is part of thesaurus Z"; these are not facts about the way the world is arranged within a particular subject domain, as might be expressed in a formal ontology.

If we go down that road, we perhaps should also consider DCAT (Data Catalogue Vocabulary). DCAT namespaces include the SKOS namespace.

Mapping would target the classification of concepts using a thematic approach.

Abstract from DCAT:

DCAT enables a publisher to describe datasets and data services in a catalog using a standard model and vocabulary that facilitates the consumption and aggregation of metadata from multiple catalogs. This can increase the discoverability of datasets and data services. It also makes it possible to have a decentralized approach to publishing data catalogs and makes federated search for datasets across catalogs in multiple sites possible using the same query mechanism and structure. Aggregated DCAT metadata can serve as a manifest file as part of the digital preservation process.

If the concept of upper level vocabulary in my previous message, or some similar approach is employed, DCAT may be the way to go. UNTP could even deploy free Apache Jena Fuseki to test the approach which supports read/write JSON LD as well as be useful to produce SHACL for validation if that helps with any use cases.

That is a lot to chew on so apologies if I'm premature with any of this.

seewodg commented 3 months ago

For those (like me) who do need to view examples to resolve their understanding, regarding DCAT I recommend looking at some of the DCAT Github Examples.

uncefact / spec-untp

Sustainability Vocabulary Design #12