nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

Data Issue: Wayne County, NC cases don't match NC DHHS website #669

Closed mikelehen closed 2 years ago

mikelehen commented 2 years ago

Describe the issue:

Fuller details

https://www.nytimes.com/interactive/2021/us/wayne-north-carolina-covid-cases.html shows a recent spike:

image


https://covid19.ncdhhs.gov/dashboard/cases-and-deaths shows no spike:

image

Is there a way to know exactly where NYT is sourcing data for a particular location? The "About this data" just says "In data for North Carolina, The Times primarily relies on reports from the state, as well as health districts or county governments that often report ahead of the state."

My guess is that NC has recently reported a backlog of cases in their cumulative count but their dashboard is somehow filtering it out or hasn't picked it up yet (perhaps because it's based on specimen collection date?).

Any help / clarity you can provide for understanding these sorts of issues would be helpful. We (https://covidactnow.org/) rely on NYT case data and increasing get questions like, "Why does your data not match Xyz state dashboard?" and it is hard to know how to answer them. Thanks!

albertsun commented 2 years ago

Hi @mikelehen yes this is a difference in dating cases due to specimen collection date versus date reported. We always use date reported.

In just about all cases we also use data from both the state department of health and the county department of health if available, (it's not available here) and take whichever source has a more up to date total cumulative number matching our definitions.

mikelehen commented 2 years ago

Thanks @albertsun! In this case, I suspect they're backdating a lot of cases then, since there's no sign of a spike starting in their graph, and their test positivity has been consistently low.

When you say "it's not available here" (in reference to the county data), how did you determine that? Is that based on an internal catalog of data sources NYT uses? Or is there some way I could determine that myself? FWIW- There is actually a county dashboard but it has ~2000 fewer cumulative cases than the state dashboard, so I don't know if it would be useful to use. Just mentioning it as an observation.

albertsun commented 2 years ago

Yes that's from our internal database of sources that we gather data from. That is messy enough/changing constantly enough that we do not provide it publicly. Probably the most reliable way to tell which of multiple sources we are using is to compare the total cumulative figure to see if that matches a source.

mikelehen commented 2 years ago

Just to flag, the data continues to be quite divergent between https://covid19.ncdhhs.gov/dashboard/cases-and-deaths and https://www.nytimes.com/interactive/2021/us/wayne-north-carolina-covid-cases.html:

image image

Offhand I don't think the differing methodology should result in this kind of discrepancy unless there's something else weird going on (e.g. Wayne County is continually reporting cases that are backdated to a previous surge from months ago).

albertsun commented 2 years ago

I think the backdating of cases to a ways back is a somewhat regular occurrence and I wouldn't surprised if that were happening with cases pushed back to the most recent Omicron spike. Another cause here could be the averaging we're applying which is smoothing the curve over irregular reporting periods.