As mentioned in the group chat. the first tentative dataset to explore is that of academic papers. The goal of this issue is to create a pipeline that can download academic papers from the web. In order to do so, the pipeline needs two elements:
Crawler
Scraper
Before implementation begins, we first have to decide on:
website to crawl (google scholar, arxiv, etc ...)
papers to download (all, last year, by field, etc ...)
what format to store the downloaded data in
Useful Resources
As a lot of papers are stored in PDF format, the following tools can help with manipulating / extracting information from PDFs:
Description
As mentioned in the group chat. the first tentative dataset to explore is that of academic papers. The goal of this issue is to create a pipeline that can download academic papers from the web. In order to do so, the pipeline needs two elements:
Before implementation begins, we first have to decide on:
Useful Resources
As a lot of papers are stored in PDF format, the following tools can help with manipulating / extracting information from PDFs:
Completion Criteria