sfbrigade / data-covid19-sfbayarea

Manual and automated processes of sourcing data for the stop-covid19-sfbayarea project
MIT License
8 stars 10 forks source link
covid-19 data python web-scraping

data-covid19-sfbayarea

Processes for sourcing data for the Stop COVID-19 SF Bay Area Pandemic Dashboard. You can find the dashboard’s source code in the sfbrigade/stop-covid19-sfbayarea project on GitHub.

We are looking for feedback! Did you come here looking for a data API? Do you have questions, comments, or concerns? Don't leave yet - let us know how you are using this project and what you'd like to see implemented. Please leave us your two cents over in Issues under #101 Feedback Mega Thread.

Installation

This project requires Python 3 to run. It was built specifically with version 3.8.6, but it may run with other versions. However, it does take advantage of assignment expressions which are only available in 3.8+. To install this project, you can simply run ./install.sh in your terminal. This will set up the virtual environment and install all of the dependencies from requirements.txt and requirements-dev.txt. However, it will not keep the virtual environment running when the script ends. If you want to stay in the virtual environment, you will have to run source env/bin/activate separately from the install script.

Running the scraper

This project includes four separate scraping tools for different purposes:

You can also run each of these tools in Docker. See the “using Docker” section below.

Legacy CDS Scraper

The Legacy CDS Scraper loads Bay Area county data from the Corona Data Scraper project. Run it by typing into your terminal:

$ ./run_scraper.sh

This takes care of activating the virtual environment and running the actual Python scraping script. If you are managing your virtual environments separately, you can run the Python script directly with:

$ python3 scraper.py

County Website Scraper

The newer county website scraper loads data directly from county data portals or by scraping counties’ public health websites. Running the shell script wrapper will take care of activating the virtual environment for you, or you can run the Python script directly:

# Run the wrapper:
$ ./run_scraper_data.sh

# Or run the script directly if you are managing virtual environments youself:
$ python3 scraper_data.py

By default, it will output a JSON object with data for all currently supported counties. Use the --help option to see a information about additional arguments (the same options also work when running the Python script directly):

$ ./run_scraper_data.sh --help
Usage: scraper_data.py [OPTIONS] [COUNTY]...

  Create a .json with data for one or more counties. Supported counties:
  alameda, san_francisco, solano.

Options:
  --output PATH  write output file to this directory
  --help         Show this message and exit.

County News Scraper

The news scraper finds official county news, press releases, etc. relevant to COVID-19 and formats it as news feeds. Running the shell script wrapper will take care of activating the virtual environment for you, or you can run the Python script directly:

# Run the wrapper:
$ ./run_scraper_news.sh

# Or run the script directly if you are managing virtual environments youself:
$ python3 scraper_news.py

By default, it will output a series of JSON Feed-formatted JSON objects — one for each county. Use the --help option to see a information about additional arguments (the same options also work when running the Python script directly):

$ ./run_scraper_news.sh --help
Usage: scraper_news.py [OPTIONS] [COUNTY]...

  Create a news feed for one or more counties. Supported counties: alameda,
  contra_costa, marin, napa, san_francisco, san_mateo, santa_clara, solano,
  sonoma.

Options:
  --from CLI_DATE                 Only include news items newer than this
                                  date. Instead of a date, you can specify a
                                  number of days ago, e.g. "14" for 2 weeks
                                  ago.

  --format [json_feed|json_simple|rss]
  --output PATH                   write output file(s) to this directory
  --help                          Show this message and exit.

Hospitalization Data Scraper

The hospitalization data scraper pulls down COVID-19-related hospitalization statistics at the county level from the California Department of Public Health via its CKAN API. To run the scraper, execute the following command in your terminal:

$ ./run_scraper_hospital.sh

By default, this will print time-series data in JSON format to stdout for all nine Bay Area counties, following the structure described in the data model documentation.

Data for all California counties is also available; to select a specific county or list of counties, add them as arguments when running the script. The county should be spelled in lowercase, with underscores replacing spaces:

$ ./run_scraper_hospital.sh alameda los_angeles mendocino

You may also pass an --output flag followed by the path to the directory where you would like the JSON data to be saved. If the directory does not exist, it will be created. The data will be saved as hospital_data.json.

Using Docker

As an alternative to installing and running the tools normally, you can use Docker to install and run them. This is especially helpful on Windows, where setting up Selenium and other Linux tools the scraper can be complicated.

  1. Download and install Docker from https://www.docker.com/ (You’ll probably need to create a Docker account as well if you don’t already have one.)

  2. Now run any of the tools by adding their command after ./run_docker.sh. For example, to run the news scraper:

    $ ./run_docker.sh python scraper_data.py

    Under the hood, this builds the Docker container and then runs the specified command in it.

    Docker acts kind of like a virtual machine, and you can also simply get yourself a command prompt inside the docker container by running ./run_docker.sh with no arguments:

    $ ./run_docker.sh
    # This will output information about the build, and then give you a
    # command prompt:
    root@ca87fa64d822:/app#
    
    # You can now run commands like the data scraper as normal from the prompt:
    root@ca87fa64d822:/app# python scraper_data.py
    root@ca87fa64d822:/app# python scraper_news.py

Data Models

The data models are in JSON format and are located in the data_models directory. For more information, see the data model readme.

Development

We use CircleCI to lint the code and run tests in this repository, but you can (and should!) also run tests locally.

The commands described below should all be run from within the virtual environment you’ve created for this project. If you used install.sh to get set up, you’ll need to activate your virtual environment before running them with the command:

$ source env/bin/activate

If you manage your environments differently (e.g. with Conda or Pyenv-Virtualenv), use whatever method you normally do to set up your environment.

Tests

You can run tests using pytest like so:

# In the root directory of the project:
$ python -m pytest -v .

Some tests run against live websites and can be slow (or worse; they might spam a county's server with requests and get your IP blocked), so they are disabled by default. To run them, set the LIVE_TESTS environment variable. It can be '*' to run live tests against all counties, or a comma separated list of counties to test.

# Run live tests against all county websites.
$ LIVE_TESTS='*' python -m pytest -v .

# Run live tests against only San Francisco and Sonoma counties.
$ LIVE_TESTS='san_francisco,sonoma' python -m pytest -v .

Linting and Code Conventions

We use Pyflakes for linting. Many editors have support for running it while you type (either built-in or via a plugin), but you can also run it directly from the command line:

# In the root directory of the project:
$ pyflakes .

We also use type annotations throughout the project. To check their validity with Mypy, run:

# In the root directory of the project:
$ mypy .

Reviewing and Merging Pull Requests

  1. PRs that are hotfixes do not require review.

    • Hotfixes repair broken functionality that was previously vetted, they do not add functionality. For these PRs, please feel free to request a review from one or more people.
    • If you are requested to review a hotfix, note that the first priority is to make sure the output is correct. "Get it working first, make it nice later." You do not have to be an expert in the function's history, nor understand every line of the diff changes. If you can verify whether the output is correct, you are qualified and encouraged to review a hotfix!
    • If no reviewers respond within 2 days, please merge in your PR yourself.
    • Examples of hotfixes are:
      1. Fixing broken scrapers
      2. Fixing dependencies - libraries, virtual environments, etc.
      3. Fixing github actions running the scrapers, and fixing CircleCI
  2. PRs that add functionality/features require at least 1 passing review.

    • If you are adding functionality, please explicitly require a review from at least one person.
    • When at least one person has approved the PR, the author of the PR is responsible for merging it in. You must have 1+ approving reviews to merge, but you don't need all requested reviewers to approve.
    • If you are one of the people required for review, please either complete your review within 3 days, or let the PR author know you are unavailable for review.
    • Examples of PRs that add functionality are:
      1. Adding new scrapers
      2. Structural refactors, such as changing the data model, or substantial rewrite of an existing scraper
  3. PRs that update the documentation require at least 1 passing review.

    • Documentation PRs are in the same tier as #2. Please explicitly require a review from at least one person.
    • When at least one person has approved the PR, the author of the PR is responsible for merging it in. You must have 1+ approving reviews to merge, but you don't need all requested reviewers to approve.
    • If you are one of the people required for review, please either complete your review within 3 days, or let the PR author know you are unavailable for review.
    • Examples are:
      1. Updates to the data fetch README
      2. Commenting code
      3. Adding to metadata
  4. Reviewers

    1. Everyone can review #1 hotfixes, or #3 documentation. If you want to proactively sign up to be first-string for these reviews, please add your github handle to the list below.

      • @elaguerta
      • @benghancock
    2. Experienced developers with deep knowledge of the project should be tapped for PRs that deal with complicated dependencies, language-specific implementation questions, structural/architectural concerns. If you want to be first-string for these reviews, please add your github handle to the list below.

      • @Mr0grog
      • @rickpr
      • @ldtcooper
    3. People who have interest in data, public health, and social science should be tapped for PRs that deal with decisions that affect how data is reported, structured, and provided to the user. If you want to be first-string for these reviews, please list your github name below.

      • @elaguerta
      • @benghancock
      • @ldtcooper