sfbrigade / stop-covid19-sfbayarea

Publish COVID-19 data and FAQ local to the Bay Area
https://panda.baybrigades.org/
MIT License
26 stars 13 forks source link

Santa Clara County chart does not match public health department dashboard #92

Open 1ec5 opened 4 years ago

1ec5 commented 4 years ago

The Santa Clara County bar chart under the Stats tab displays daily and total figures that differ from the official Santa Clara County Public Health Department dashboard.

Steps to reproduce

  1. Go to the Stats tab. Set the toggle to Daily.
  2. Hover over March 30. Observe that 202 new cases are attributed to that day.
  3. Go to the official county dashboard.
  4. In the first dashboard, right-click on the “New Cases by Specimen Collection Date” bar chart and choose “Show as table”.
  5. Scroll down and observe that 77 cases are attributed to that day.

Expected behavior

I’d expect the two sources to match, assuming the Santa Clara County Public Health Department and CalREDIE are the ultimate source of this data. Otherwise, if the data is coming from another source, it would be great if that source were easier to identify.

Screenshots

stop-covid19-sfbayarea: Stop Coronavirus in the Bay Area

Santa Clara County Public Health Department: Coronavirus (COVID-19) Data Dashboard - Novel Coronavirus (COVID-19) - County of Santa Clara

1ec5 commented 4 years ago

The application cites Corona Data Scraper, which fetches figures from this spreadsheet by The Mercury News, which cites the Santa Clara County Public Health Department as its source.

As far as I can tell, Corona Data Scraper is simply fetching the latest day’s total case count and adding that as an entry in the database. That is an important statistic, but a time series based on it would be influenced by delays in testing.

In mid-April, the county changed its methodology to report historical case counts. Every day, they retroactively update as many as 40 past days to reflect how many tests were taken on a given day that later came back positive. This way the curve more accurately depicts the rate of (confirmed) infection over time. On the other hand, it can be tedious to keep track of so many numbers, and a couple dozen cases are undated at any given time and can’t be represented in the time series at all.

Which methodology is more appropriate for this application? I suppose consistency with other Bay Area counties is important for this website. It’s also worth noting that the county only provides historical data on case counts and not deaths, so can be misleading to combine the two time series in a single chart. On the other hand, there’s a lot of value in seeing an accurate curve. (Santa Clara County is flatter than this application indicates.)

Over on Wikimedia Commons, I’ve been tracking Santa Clara County’s outbreak using the county’s preferred methodology, updating this table to power this chart on Wikipedia and possibly elsewhere. This script automatically generates an updated table to copy-paste into Commons. Hopefully this script will be useful to the project. (Apologies in advance for the obtuse jq usage.)

1ec5 commented 4 years ago

Same thing for San Francisco: covidatlas/coronadatascraper#1011.