Dataset Extraction - Githubissues

Description

As mentioned in the group chat. the first tentative dataset to explore is that of academic papers. The goal of this issue is to create a pipeline that can download academic papers from the web. In order to do so, the pipeline needs two elements:

Crawler
Scraper

Before implementation begins, we first have to decide on:

website to crawl (google scholar, arxiv, etc ...)
papers to download (all, last year, by field, etc ...)
what format to store the downloaded data in

Useful Resources

As a lot of papers are stored in PDF format, the following tools can help with manipulating / extracting information from PDFs:

Apache Tika
Apache PDFBox

Completion Criteria

[ ] Generate URLs to access papers specified by input criteria
[ ] Download specified papers
[ ] Store papers in usable format

saadsharif / ttds-group

Dataset Extraction #6

Description

Useful Resources

Completion Criteria