nikwilms / ESG-Score-Prediction-from-Sustainability-Reports

This repository contains code and data for a machine learning model that predicts ESG (Environmental, Social, and Governance) scores based on sustainability reports and company data. It's a valuable resource for researchers, investors, and sustainability professionals interested in ESG score prediction using machine learning techniques.
MIT License
14 stars 2 forks source link

Tweak LDA #79

Closed mariusbosch closed 9 months ago

mariusbosch commented 9 months ago

ncluding information from the LDA model can indeed enhance the predictive capabilities of your regression model, but it's essential to be cautious about not introducing collinearity or excessive noise. Here's what you might consider:

Topic Keywords:

Instead of adding the words of each topic directly as features (which would greatly increase dimensionality and might introduce noise), you can derive some aggregated features. For example, for each document, you can count how many times the top 5 (or 10 or any number) keywords from each topic appear. This gives you a measure of the "strength" or "relevance" of the topic's main themes within each document, in addition to the probabilities. Another approach is to create binary features: for each primary topic of a document, check if its top keywords appear in the document. Document Length:

The length of the document (number of words or characters) can be a useful feature. Sometimes, longer documents might have more detailed information which could influence the target variable. Dominant Topic:

For each document, identify which topic has the highest probability and create a categorical feature indicating the dominant topic. This can then be one-hot encoded for the regression model. Number of Topics Above a Threshold:

For each document, count how many topics have a probability above a certain threshold (e.g., 0.05). This can capture the breadth of topics discussed in a document. Topic Entropy:

Entropy can be used to measure the "diversity" of topics in a document. A document that discusses many topics equally will have high entropy, while a document focused mainly on one topic will have low entropy. Metadata:

If you have any additional metadata about each document (e.g., the year of publication, author, source, etc.), these can also serve as valuable features, especially if there's reason to believe they might influence the target variable. Other Text Features:

Beyond LDA, consider other text-derived features: Presence of specific keywords or phrases that might be relevant to the target variable. Text sentiment or polarity, which can be extracted using tools like TextBlob or VADER. Named Entity Recognition (NER) counts or specific entity types (e.g., number of organizations, persons, locations mentioned). Interaction Features:

Consider creating interaction features between certain key topics or between a topic and another feature. For example, if two topics seem to jointly influence the target, an interaction term between their probabilities might capture this effect. Remember, while adding more features can improve the model, it also introduces the risk of overfitting, especially if you have a small dataset. Always monitor your model's performance on a validation set and consider using techniques like regularization or feature selection to manage complexity.

mariusbosch commented 9 months ago

Given the nature of your data and the problem at hand (predicting ESG scores based on sustainability reports), here's a prioritized approach:

NER Type Distribution:

Sustainability reports often discuss various entities like organizations, locations, monetary values, percentages, and dates. The distribution of these entities can be indicative of the focus of the report. For instance, frequent mentions of monetary values might suggest discussions about investments or financial impacts. How: Parse the NER column to count occurrences of each type and create new columns like count_organizations, count_locations, etc. Specific Entity Presence:

There might be certain organizations, standards, or terms that are more influential in the ESG world. Mentioning a recognized environmental agency or adhering to a global sustainability standard might be significant. How: Create binary columns indicating the presence of specific influential entities in each document. Entity Aggregation:

If you have entities that represent monetary values or percentages, aggregating them might provide insights. For instance, the total mentioned investments in sustainable technologies or the percentage reduction in emissions. How: Parse and sum up all monetary values or percentages for each document. Entity Sentiment Analysis:

The sentiment around certain mentions can be crucial. A report might mention environmental disasters, but the sentiment around how they're handling or preventing it matters. How: Use a sentiment analysis tool to get sentiment scores for sentences containing named entities. Then, average or sum these scores per document. Count of Named Entities:

A simple count can be indicative. Reports rich in named entities might be more detailed or have more partnerships and engagements. How: Count the total number of named entities for each document. Named Entity Co-occurrence:

If there are combinations of entities that are particularly meaningful, this can be added. For instance, if the co-mention of a company with certain sustainability terms is significant. How: Count occurrences of specific entity pairs in each document. One-Hot Encoding or Embedding:

This can be useful but might also add a lot of dimensions to your data. Use this if specific entities are very influential and if the dataset isn't too large. How: One-hot encode the most common entities or use embeddings to convert them to dense vectors. Steps:

Start by adding features based on the first two or three points. Train a baseline regression model. Add features from subsequent points incrementally, retraining the model each time. Monitor model performance (using a validation set) to ensure each set of features provides a benefit. Remember, feature engineering is as much art as it is science. The key is to iterate and validate, ensuring each addition improves the model's performance on unseen data.

mariusbosch commented 9 months ago

Given your LDA model and its current results, here's a suggested prioritization for tweaking and improving:

Number of Topics:

This is often the most influential parameter in topic modeling. You should experiment with different numbers to find a sweet spot where topics are distinct and meaningful. How: Use grid search or tools like Optuna (which you've been using) to try out a range of topic numbers. Review the topics for each configuration and choose a number where topics are coherent and distinct. Document Length:

Very short documents can be noisy, and extremely long documents might be too diverse, diluting the topic concentration. How: Filter out documents below a certain word threshold or split very long documents into multiple parts. Rerun the LDA and observe changes in coherence. Text Preprocessing:

Ensuring that the text is properly cleaned and preprocessed is essential. Things like punctuation, stop words, and extremely frequent or rare words can influence topic quality. How: Review your preprocessing steps. Consider refining stopword lists, further cleaning the data, or applying more aggressive filtering of rare/frequent terms. Model Hyperparameters:

Besides the number of topics, other hyperparameters like alpha and beta (or eta in Gensim) can influence topic granularity and distribution. How: Use Optuna or grid search to optimize these hyperparameters. Often, this is done in conjunction with optimizing the number of topics. Higher-order N-grams:

Bigrams and trigrams can capture more nuanced meanings and can improve topic interpretability. How: Review your bigram/trigram models. Ensure they're capturing meaningful combinations and not just frequent co-occurrences. Model Evaluation:

A qualitative review is essential, but quantitative metrics can guide the tuning process. How: Besides coherence, consider perplexity and other LDA evaluation metrics. However, always ensure that quantitative improvements align with qualitative (interpretability) improvements. Regularization:

Regularization can help in refining topic distributions. How: Gensim's LDA implementation has regularization built-in, but understanding and possibly tweaking its impact can be beneficial. Dynamic Topic Models:

If you believe topics change over the years in your dataset, this can be a more advanced step to capture such dynamics. How: After stabilizing your standard LDA model, consider using Dynamic Topic Models (if your dataset spans multiple years or time periods) to capture topic evolution. Steps:

Begin with the first two or three points. These often give the most significant improvements. Iterate over your LDA model, making one change at a time so you can attribute any improvements or declines in quality to specific modifications. After each change, evaluate the model both quantitatively (using coherence, perplexity, etc.) and qualitatively (interpreting topics manually). Once you've exhausted improvements from the initial points, proceed to the next ones. Remember, topic modeling is iterative. It's essential to keep refining and reviewing until you achieve topics that are both coherent and representative of the underlying content.