nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

Data Issue: Clark County, WA (and probably all of WA state) #667

Closed mdoggydog closed 2 years ago

mdoggydog commented 2 years ago

Describe the issue:

Fuller details

Sites such as CovidActNow (https://covidactnow.org/us/washington-wa/county/clark_county/?s=31121951) are reporting a Cumulative_Count_Cases_Hospitalizations_Deaths_Vaccinations.xlsx EpiCurve_Count_Cases_Hospitalizations_Deaths.xlsx daily case rate of 21.3 per 100k for Clark County, WA --- whereas the county itself (https://clark.wa.gov/public-health/covid-19-data) is reporting something more like 3.8 daily (53.9 cases per 100k over 14 days). The key to the discrepancy lies in a footnote on the county's website: "Washington Department of Health continues to clear a backlog of cases caused by reporting delays during the omicron surge. As a result, some cases added to the total this week may have occurred earlier."

That footnote applies to the "Total number of cases (confirmed & probable)" figure on the county website, a number which is updated every couple of days and which appears to correspond to the cumulative case counts published here in the NYT data set. However, note that this number distorts the time series of the case counts, since cases can be added to the cumulative count long after they occurred.

Washington state actually publishes two complete spreadsheets with case count data. One lists daily totals and cumulative counts for cases by the day on which test results are entered into the system. This corresponds to the NYT data (although, with per-day granularity rather than every-other-day granularity). The other spreadsheet lists daily case totals by the day of specimen collection, and this is the time series used by the county itself to report current case rates, as it more closely reflects when the reported cases actually occurred.

Please consider switching the NYT data to use the "by specimen collection date" data/methodology for Washington state, rather than the "by test result logged in system date". The current NYT data is severely distorting the picture for Clark County, making it look like there is a secondary peak of infections, when it is really just a result in catching up on a reporting backlog.

I will attach the current (as of this post) versions of each spreadsheet from Washington state. I do not have links for them, but they can be downloaded via buttons on the "Epidemiological Curves" and "Cumulative Counts" tabs in the "Washington Department of Health COVID-19 data dashboard" embed in the Clark County data webpage (https://clark.wa.gov/public-health/covid-19-data).

mdoggydog commented 2 years ago

"By the day on which tests results are entered into the system":

Cumulative_Count_Cases_Hospitalizations_Deaths_Vaccinations.xlsx

"By the day of specimen collection":

EpiCurve_Count_Cases_Hospitalizations_Deaths.xlsx

tiffehr commented 2 years ago

@mdoggydog Thank you for the detailed Issue. We are not likely to change our methodology for Washington state or Clark county. Our 2+-year-long methodology is "cases and deaths are counted on the date they are first announced" by the local health department (county or state). As we note in our README, "the methodology of individual states changes frequently." With some states and counties back-dating cumulative cases/deaths after they are announced, we simply don't have the staffing to keep up with continually revising case histories per county.

We do review our figures on a daily basis, looking for anomalies and marking them as such in our public list. Anomalies are excluded from our moving average calculations. Our own visualizations focus on our rolling average, due to these differences in methodology and reporting cadence for local or state health departments across the U.S.