parth126 / IT550

Project Proposals for the IT-550 Course (Autumn 2024)
0 stars 0 forks source link

Claim Extraction in Solar Energy News Articles #2

Open DistilledCode opened 1 month ago

DistilledCode commented 1 month ago

Claim Extraction in Solar Energy News Articles

Team

Category

New Research Problem

Problem Statement

This project aims to develop a model capable of extracting claims from solar energy-related news articles. We will either create a novel model from scratch or fine-tune an existing model to achieve this goal. The project will involve the creation of a specialized dataset, establishing baseline performance, and implementing an iterative active learning approach to continuously improve the model's accuracy.

What is a Claim?

Dataset Construction and Baseline Establishment

Approach to Claim Extraction

  1. Machine Learning Classification: Train models to classify individual sentences as claims or non-claims
  2. Sequence Labeling: Implement models that tag individual spans of text as components of claims
  3. Rule-Based: Too trivial to consider

Iterative Active Learning

  1. Initial Training: Train the model on a small, high-quality subset of manually annotated data.
  2. Automated Labeling: Use the trained model to label claims in the remaining dataset, assigning confidence scores to each prediction.
  3. Manual Annotation: Focus human annotation efforts on instances where the model exhibits low confidence.
  4. Iterative Improvement: Incorporate newly annotated data into the training set and retrain the model.
  5. Performance Evaluation: Assess model improvement after each iteration using the held-out test set.

Evaluation Strategy

Dataset

News articles (~110k) scrapped from the web.

Resources

parth126 commented 1 month ago

Suggested Read: https://arxiv.org/pdf/2207.02522

Currently the problem seems a bit trivial mainly based on API calls to LLMs. Suggested to rethink the topic

parth126 commented 1 month ago