Generate a corpus of articles with the following constraints:
Articles may be dated no earlier than January 2019 and no later than March 2021
Articles should only pertain to "U.S. Politics" and "Basketball" and "COVID" (Collect articles in that order)
Articles should be in English
We define an article as:
An article title
An article summary
An article publication date
Source of the article (i.e. nytimes.com)
Proposed approach
Register all API keys and store them in a 'config.py' file.
Create a script that takes iterates through each API, and maximizes the API call limit (per day).
All the articles returned from the query should be stored in a CSV
The CSV should be a y column table (title, summary, publication date, source, query) where the query is the category it falls under (i.e. U.S. Politics, Basketball, Covid, etc.).
The CSV should be able to continuously be written to, and able to be read easily.
Parameterize number of calls
Consider using Pandas to temporarily hold articles in memory, and use the pd.to_csv() function to quickly turn into csv.
Functional Representation
Input: Query
Output: CSV of Articles Data
Function: Call the APIs
Problem Definition
Given the following APIs:
Generate a corpus of articles with the following constraints:
We define an article as:
Proposed approach
The CSV should be a y column table (title, summary, publication date, source, query) where the query is the category it falls under (i.e. U.S. Politics, Basketball, Covid, etc.).
The CSV should be able to continuously be written to, and able to be read easily.
Parameterize number of calls
Consider using Pandas to temporarily hold articles in memory, and use the pd.to_csv() function to quickly turn into csv.
Functional Representation
Input: Query Output: CSV of Articles Data Function: Call the APIs
f(query) = csv query = {'U.S. Politics', 'Basketball', 'Covid'}