Rule engine for data transformation

caldeirav commented 1 year ago

Maintenance of taxonomies for data should ideally be done in some kind of standard format with the ability to build rules for data equivalence between different data formats. This would be useful in particular in the case of ESG taxonomies mapping. Without such an ability to have mappings maintained in a one dimensional format, a lot of maintenance is required for cross-mappings for example:

https://github.com/OS-SFT/Taxonomy-Mappings-Library

This issue is to investigate a better way to maintain mappings in order to support the taxonomy equivalence project run within OS-Climate.

@MichaelTiemannOSC

caldeirav commented 1 year ago

@sabrycito @jzwerg as discussed we can start by documenting some examples of taxonomy mappings with different degrees of complexity and attach them to this issue.

jzwerg commented 1 year ago

@caldeirav well noted - @sabrycito will share available taxonomies mappings today, together with the documentation explaining our methodology and findings.

These taxonomies will include:

2 global standards: SASB, GRI
2 disclosure templates from stock exchanges: HKEX, SGX
1 industry standard: EDCI

We are currently working on adding 1 global framework (TCFD), 1 global standard (ISSB), and 1 regulatory standard (EET) - which we will share as well in the following weeks.

We also encourage the broader OS-C community to contribute to this work stream with additional global, regulatory, exchange, or industry standards.

jzwerg commented 1 year ago

As discussed during yesterday's meeting, please find the presentation attached. ESG Equivalence - Explanation.pdf

caldeirav commented 1 year ago

Upon initial analysis we propose two possible approaches for rule-based integration to be demonstrated in a POC:

First approach would be to leverage DBT pipelines with added capabilities on lineage graphing such as ER Maps (https://medium.com/@nshivakumar/er-maps-for-dbt-users-cdb0df872d0b) for semantic analysis and visualisation. Advantage of this approach is that the rules can be captured as SQL code, integrate with existing data mesh architecture, and lineage can be produced by OpenMetadata.
Second approach would be to leverage a more powerful data mapper such as the one provided by Syndesis (https://syndesis.io/docs/connectors/datamapper/). With this approach we still manage the mapping as code (using Java) and we benefit from Camel integration that we could use to transform data via batch or real time through event processing (JSON). Also syndesis comes with a mapping tool for business users. Downside is additional integration with data pipelines (data needs to be loaded as JSON to get processes in the transformer) which is not overly complex but will create additional support.

We have decided with @sabrycito and @jzwerg to prototype the first solution with DBT based on data inputs / outputs provided by U-Reg.

caldeirav commented 9 months ago

@sabrycito / @jzwerg Could you provide some use-cases for us to prototype this please? Ideally let's keep this information as a link shared in this issue.

sabrycito commented 9 months ago

- Case 0-0:

As a user, I want to understand the equivalence between "Disclosure A: 'report the percentage of total employees covered by collective bargaining agreements'" and "Disclosure B: 'Percentage of active workforce covered under collective bargaining agreements'", recognizing that they represent identical semantics expressed with different linguistic constructions.

- Case 1-2 and (2-1):

Direct Sub-case:

As a user, I need to compare "Disclosure A: '(1) Total energy consumed, (2) percentage heavy fuel oil, (3) percentage renewable'" with "Disclosure B: 'Total energy consumption'", noting that A provides all necessary data for B, but B only partially satisfies A without the heavy fuel oil and renewable components.

Sub-case requiring computation:

As a user, I want to understand how to compute the data in "Disclosure B: 'Total number of employees, percentage contractors'" from "Disclosure A: 'Total workforce by gender, employment type (for example, full- or part-time), age group and geographical region'", and recognize that while B can be derived from A, A requires additional information not present in B.

- Case 1-3 (and 3-1):

As a user, I am interested in understanding how "Disclosure A: 'Risks and opportunities posed by climate change that have the potential to generate substantive changes in operations, revenue, or expenditure, including a description of the risk or opportunity and its classification as either physical, regulatory, or other'" compares to "Disclosure B: 'TRANSITION RISK: Has the customer faced or expected to face any impact from policy and regulation risk?'", and how the response to B does not provide insight for A, yet A's information can be used for B. Similarly, how "Disclosure A: 'Description of efforts in solar energy system project development to address community and ecological impacts'" relates to "Disclosure B: 'Focus areas of contribution (e.g. education, environmental concerns, labour needs, health, culture, sport)'", noting that B does not aid in fulfilling A.

- Case 2-2:

Direct sub-case:

As a user, I want to compare "Disclosure A: 'report the total number of employees, and a breakdown of this total by gender and by region'" with "Disclosure B: 'Number of employees, number of truck drivers'", acknowledging that A to B lacks information on truck drivers, and B to A lacks the breakdown by gender and region.

Sub-case requiring computation:

As a user, I aim to understand how "Disclosure A: 'Total fuel consumption within the organization from renewable sources, in joules or multiples, and including fuel types used.'" compares to "Disclosure B: 'Fleet fuel consumed, percentage renewable'", recognizing the need for computation to reconcile the differences in the data provided by each disclosure.

- Case 2-3 (and 3-2):

As a user, I need to understand the relationship between "Disclosure A: 'describe the process for designing its remuneration policies and for determining remuneration, including whether remuneration consultants are involved in determining remuneration and, if so, whether they are independent of the organization, its highest governance body and senior executives'" and "Disclosure B: 'Information on: the policies relating to compensation and dismissal, recruitment and promotion, working hours, rest periods, equal opportunity, diversity, anti-discrimination, and other benefits and welfare.'", acknowledging that A provides some but not all the information in B, and vice versa.

- Case 3-3:

As a user, I am interested in comparing "Disclosure A: 'Discussion of engagement processes and due diligence practices with respect to human rights, indigenous rights, and operation in areas of conflict'" with "Disclosure B: 'describe its specific policy commitment to respect human rights, including the internationally recognized human rights that the commitment covers'", recognizing that both involve human rights but focus on different aspects, leading to a possible but not guaranteed overlap.

jzwerg commented 9 months ago

Hi Vincent,

In addition to the methodology above, you can find below our implementation and observations. I have attached the SASB, GRI, and EDCI examples for reference.

Implementation We initiated our solution by harvesting disclosure information from various ESG frameworks, often faced with unstructured data in the form of PDF files from framework websites. Leveraging Python PDF libraries, we have extracted and organized this data, infusing structure by introducing unique ID for each disclosure. (see "2. Structured ESG Standards" below)

U-Reg endeavoured to group these disclosures into clusters with high equivalence probability. Utilizing an NLP-based algorithm, we employed a domain-specific BERT model, an innovation by Google, for text mining in the sustainable investing sector. This resulted in our ESG disclosures being classified into 26 distinct labels. (see "4. Exploration" below)

U-Reg adopted a multipronged approach for creating equivalence pairings, in alignment with our corporate philosophy to leverage open-source technologies whenever feasible. U-Reg formulated user stories to define 4 levels of equivalence. (refer to Sabry's paragraph above)

U-Reg is working on an algorithms to identify potentially equivalent disclosures, and sentence similarity algorithms to assess similarity between disclosure pairs on a scale of 0 to 1. (see "4. Exploration" below)

The culmination of these efforts will be an all-encompassing database (i.e., master model), each entry comprising the source and target framework, source and target disclosure code, and equivalence level. (see "3. Equivalence" below)

1. Original ESG Standards SASB Standards https://sasb.org/standards/download/ GRI Universal Standards 1, 2, 3 https://www.globalreporting.org/how-to-use-the-gri-standards/gri-standards-english-language/ EDCI Data Submission Template https://www.esgdc.org/metrics/

2. Structured ESG Standards SASB sasb.xlsx GRI gri.xlsx EDCI edci.xlsx

3. Equivalence 1-1 Equivalence 1_to_1_equivalence_GRI_SASB_EDCI.csv Master Model master_GRI_SASB_EDCI.xlsx

4. Exploration Text Similarity bert_similarity.xlsx Clustering bert_classification_3_labels_1_to_1_GRI_SASB_EDCI.xlsx

Observations Before initiating automation, U-Reg conducted a thorough manual equivalence mapping across seven different frameworks, identifying over 7,000 equivalence pairs. This extensive manual database serves as a robust training dataset for testing our various algorithm implementations.

Given the unique nature of our project, we acknowledge that data for testing our algorithms is finite – 7,000 data points might not be massive in traditional machine learning, but for our context, it is significant. Data augmentation through the creation of random ESG disclosures presents challenges, as does sourcing external equivalence data for testing our models - primarily since we pioneer ESG equivalence, and fellow companies rarely share their results openly.

Nevertheless, U-Reg has employed this training set to test the developed technology. By comparing the results of our developed solutions with our training set, we can ascertain the effectiveness of our models. We opted for technologies showing a high degree of alignment with our training dataset. For instance, with sentence similarity, we favour algorithms that assign high scores to our equivalence pairs and low scores for unrelated disclosures, thereby reducing false positives and false negatives.

Moreover, as less than 1% of disclosure pairs are truly equivalent, we've decided not to rely solely on accuracy as a success criterion. Instead, we've chosen to prioritize the reduction of false negatives over false positives. It's more beneficial to identify non-equivalent disclosure pairs and manually eliminate them, rather than miss pairs that are indeed equivalent.

For this reason, U-Reg seeks to maximize the Recall score over the Precision score in our algorithm evaluations. We've placed greater weight on the identification of true equivalences versus non-equivalences in our machine learning explorations. This decision will, we believe, ultimately prove the effectiveness of our solution.

opendatahub-io-contrib / data-mesh-pattern

Rule engine for data transformation #73