samirchowd commented 3 years ago

Problem Definition

Given the following APIs:

NYTimes API (https://developer.nytimes.com/apis)
NewsAPI (https://newsapi.org/)
TheNewsAPI (https://www.thenewsapi.com/)

Generate a corpus of articles with the following constraints:

Articles may be dated no earlier than January 2019 and no later than March 2021
Articles should only pertain to "U.S. Politics" and "Basketball" and "COVID" (Collect articles in that order)
Articles should be in English

We define an article as:

An article title
An article summary
An article publication date
Source of the article (i.e. nytimes.com)

Proposed approach

Register all API keys and store them in a 'config.py' file.
Create a script that takes iterates through each API, and maximizes the API call limit (per day).
All the articles returned from the query should be stored in a CSV

The CSV should be a y column table (title, summary, publication date, source, query) where the query is the category it falls under (i.e. U.S. Politics, Basketball, Covid, etc.).

The CSV should be able to continuously be written to, and able to be read easily.

Parameterize number of calls

Consider using Pandas to temporarily hold articles in memory, and use the pd.to_csv() function to quickly turn into csv.

Functional Representation

Input: Query Output: CSV of Articles Data Function: Call the APIs

f(query) = csv query = {'U.S. Politics', 'Basketball', 'Covid'}

ebdelca commented 3 years ago

Sample output from NYTimes API on search term "covid 19" Format: headline, abstract, author, publication date, source, url output.zip

ebdelca commented 3 years ago

output.zip

samirchowd / NewsChain

scrape.py #2

Problem Definition

Proposed approach

Functional Representation