heyodai / cs5530-project

Project for CS 5530 class
GNU General Public License v3.0
1 stars 1 forks source link

Phase 1 - Finding a Training Dataset #7

Closed heyodai closed 1 year ago

heyodai commented 1 year ago

Summary: Identify potential datasets for training an AI to summarize research papers

Description: We need to find a dataset that includes a large number of research papers in various domains. The dataset should be publicly available and labeled with summaries for each paper. This ticket involves researching potential datasets, selecting the most appropriate one for our project, and documenting the findings.

heyodai commented 1 year ago

arXiv has made the metadata for their entire catalog (2.2 million papers) available via Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv

The dataset contains only summaries, because the full paper archive is ~1.1 TB. Even just these summaries comes to 3.65 GB. We can use API calls to grab papers as needed for training.

This seems to be the best dataset by far that's publicly accessible. Any objections to using it?

cc: @m-nweke @orangedoor

heyodai commented 1 year ago

I spoke with @orangedoor today before class. We agreed that research papers are problematic because there's not a good bullet-point summary dataset to train on.

We're looking at summarizing news articles instead. Here's a dataset that seems like a good candidate: https://www.kaggle.com/datasets/sbhatti/news-summarization