nikwilms / ESG-Score-Prediction-from-Sustainability-Reports

This repository contains code and data for a machine learning model that predicts ESG (Environmental, Social, and Governance) scores based on sustainability reports and company data. It's a valuable resource for researchers, investors, and sustainability professionals interested in ESG score prediction using machine learning techniques.
MIT License
19 stars 3 forks source link

Feature engineering #80

Closed mariusbosch closed 12 months ago

mariusbosch commented 1 year ago

Given the nature of your data and the problem at hand (predicting ESG scores based on sustainability reports), here's a prioritized approach:

NER Type Distribution:

Sustainability reports often discuss various entities like organizations, locations, monetary values, percentages, and dates. The distribution of these entities can be indicative of the focus of the report. For instance, frequent mentions of monetary values might suggest discussions about investments or financial impacts. How: Parse the NER column to count occurrences of each type and create new columns like count_organizations, count_locations, etc. Specific Entity Presence:

There might be certain organizations, standards, or terms that are more influential in the ESG world. Mentioning a recognized environmental agency or adhering to a global sustainability standard might be significant. How: Create binary columns indicating the presence of specific influential entities in each document. Entity Aggregation:

If you have entities that represent monetary values or percentages, aggregating them might provide insights. For instance, the total mentioned investments in sustainable technologies or the percentage reduction in emissions. How: Parse and sum up all monetary values or percentages for each document. Entity Sentiment Analysis:

The sentiment around certain mentions can be crucial. A report might mention environmental disasters, but the sentiment around how they're handling or preventing it matters. How: Use a sentiment analysis tool to get sentiment scores for sentences containing named entities. Then, average or sum these scores per document. Count of Named Entities:

A simple count can be indicative. Reports rich in named entities might be more detailed or have more partnerships and engagements. How: Count the total number of named entities for each document. Named Entity Co-occurrence:

If there are combinations of entities that are particularly meaningful, this can be added. For instance, if the co-mention of a company with certain sustainability terms is significant. How: Count occurrences of specific entity pairs in each document. One-Hot Encoding or Embedding:

This can be useful but might also add a lot of dimensions to your data. Use this if specific entities are very influential and if the dataset isn't too large. How: One-hot encode the most common entities or use embeddings to convert them to dense vectors. Steps:

Start by adding features based on the first two or three points. Train a baseline regression model. Add features from subsequent points incrementally, retraining the model each time. Monitor model performance (using a validation set) to ensure each set of features provides a benefit. Remember, feature engineering is as much art as it is science. The key is to iterate and validate, ensuring each addition improves the model's performance on unseen data.