saadsharif / ttds-group

TTDS Group Project
3 stars 0 forks source link

Dataset Extraction #6

Open enzo-inc opened 2 years ago

enzo-inc commented 2 years ago

Description

As mentioned in the group chat. the first tentative dataset to explore is that of academic papers. The goal of this issue is to create a pipeline that can download academic papers from the web. In order to do so, the pipeline needs two elements:

  1. Crawler
  2. Scraper

Before implementation begins, we first have to decide on:

Useful Resources

As a lot of papers are stored in PDF format, the following tools can help with manipulating / extracting information from PDFs:

Completion Criteria

enzo-inc commented 2 years ago

After comparing several PDF parsers, tika appears to be the best, both in terms of processing time and quality of the extracted text.