Closed heyodai closed 1 year ago
arXiv has made the metadata for their entire catalog (2.2 million papers) available via Kaggle: https://www.kaggle.com/datasets/Cornell-University/arxiv
The dataset contains only summaries, because the full paper archive is ~1.1 TB. Even just these summaries comes to 3.65 GB. We can use API calls to grab papers as needed for training.
This seems to be the best dataset by far that's publicly accessible. Any objections to using it?
cc: @m-nweke @orangedoor
I spoke with @orangedoor today before class. We agreed that research papers are problematic because there's not a good bullet-point summary dataset to train on.
We're looking at summarizing news articles instead. Here's a dataset that seems like a good candidate: https://www.kaggle.com/datasets/sbhatti/news-summarization
Summary: Identify potential datasets for training an AI to summarize research papers
Description: We need to find a dataset that includes a large number of research papers in various domains. The dataset should be publicly available and labeled with summaries for each paper. This ticket involves researching potential datasets, selecting the most appropriate one for our project, and documenting the findings.