mediacloud / directory-issues

UNDER CONSTRUCTION - A package containing a library of issue validators in a flexibly deployable wrapper.
0 stars 1 forks source link

Scrape local US sources from Local News website #3

Open rahulbot opened 1 week ago

rahulbot commented 1 week ago

The local news initiative website has a robust list of news sources in states across the US, including county data (recently updated and republished). It'd be helpful for us to have that list to potentially create county-level collections, even though there aren't any URLs. Since they have an API-backed service, and no batch download, it might be relatively easy to scrape the data.

Sample URL: https://www.northwesternlni.com:8068/lni/localnewstable?state=MA&county=Hampshire&year=2024 Sample JSON:

[
  {
    "id": 44612,
    "state": "MA",
    "county": "Hampshire",
    "mediaName": "Amherst Bulletin",
    "mediaType": "Newspaper",
    "yearLoaded": "2024"
  },
  {
    "id": 44613,
    "state": "MA",
    "county": "Hampshire",
    "mediaName": "Daily Hampshire Gazette",
    "mediaType": "Newspaper",
    "yearLoaded": "2024"
  },
  {
    "id": 44614,
    "state": "MA",
    "county": "Hampshire",
    "mediaName": "Valley Advocate",
    "mediaType": "Newspaper",
    "yearLoaded": "2024"
  },
...

The task here would be to built a scraper in a Jupyter notebook perhaps that pulls all the data into a CSV. Then we can review and decide what we might want to do with it.

pgulley commented 2 days ago

@m453h - This would be super helpful to have for the directory health team. Some goal deliverables would be:

  1. A json dump replicating LNI's county/state news collections
  2. A comparison against our local news collections in directory.mediacloud.org- Firstly whether we index a given local news site or not, and then secondly if it belongs to the appropriate collection. The second step should be automatable with the Directory api.