samirchowd / NewsChain

CS 329 Project
0 stars 0 forks source link

scrape.py #2

Open samirchowd opened 3 years ago

samirchowd commented 3 years ago

Problem Definition

Given the following APIs:

Generate a corpus of articles with the following constraints:

We define an article as:

Proposed approach

  1. Register all API keys and store them in a 'config.py' file.
  2. Create a script that takes iterates through each API, and maximizes the API call limit (per day).
  3. All the articles returned from the query should be stored in a CSV

The CSV should be a y column table (title, summary, publication date, source, query) where the query is the category it falls under (i.e. U.S. Politics, Basketball, Covid, etc.).

The CSV should be able to continuously be written to, and able to be read easily.

Parameterize number of calls

Consider using Pandas to temporarily hold articles in memory, and use the pd.to_csv() function to quickly turn into csv.

Functional Representation

Input: Query Output: CSV of Articles Data Function: Call the APIs

f(query) = csv query = {'U.S. Politics', 'Basketball', 'Covid'}

ebdelca commented 3 years ago

Sample output from NYTimes API on search term "covid 19" Format: headline, abstract, author, publication date, source, url output.zip

ebdelca commented 3 years ago

output.zip