Claim Extraction in Solar Energy News Articles

DistilledCode commented 1 month ago

Claim Extraction in Solar Energy News Articles

Team

202318013 - Vrishmi Parikh
202318030 - Mahmood Topiwala
202318039 - Anurag Shukla
202318056 - Tanaz Pathan

Problem Statement

This project aims to develop a model capable of extracting claims from solar energy-related news articles. We will either create a novel model from scratch or fine-tune an existing model to achieve this goal. The project will involve the creation of a specialized dataset, establishing baseline performance, and implementing an iterative active learning approach to continuously improve the model's accuracy.

What is a Claim?

A statement that can be verified as true or false
An assertion about the world that expresses a belief or opinion
An arguable proposition

Dataset Construction and Baseline Establishment

Dataset Creation: construct a new, specialized dataset on solar energy news articles.
Test Set Development: A portion of the dataset will be manually annotated and verified to create a test set.
Baseline Performance: Existing general claim extraction models will be evaluated on this dataset to establish baseline performance metrics.

Approach to Claim Extraction

Machine Learning Classification: Train models to classify individual sentences as claims or non-claims
Sequence Labeling: Implement models that tag individual spans of text as components of claims
Rule-Based: Too trivial to consider

Iterative Active Learning

Initial Training: Train the model on a small, high-quality subset of manually annotated data.
Automated Labeling: Use the trained model to label claims in the remaining dataset, assigning confidence scores to each prediction.
Manual Annotation: Focus human annotation efforts on instances where the model exhibits low confidence.
Iterative Improvement: Incorporate newly annotated data into the training set and retrain the model.
Performance Evaluation: Assess model improvement after each iteration using the held-out test set.

Evaluation Strategy

Tested against our own test set
Metrics: F1, Precision, Recall
Comparison against baseline models

Dataset

News articles (~110k) scrapped from the web.

Resources

parth126 commented 1 month ago

Suggested Read: https://arxiv.org/pdf/2207.02522

Currently the problem seems a bit trivial mainly based on API calls to LLMs. Suggested to rethink the topic

parth126 commented 1 month ago

Suggested adding experiments on a few known datasets
Atleast one claim extraction method is expected to be implemented from scratch

parth126 / IT550