nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

Data Issue: 5/6 New York City stats deviation from norm seem quite excessive #261

Closed bobket45 closed 4 years ago

bobket45 commented 4 years ago

Describe the issue:

Fuller details

A clear and concise description of the problem, with examples if possible. If you are reporting incorrect data for a specific locality, please include a link to your source. We will compare with our own list of sources.

ddenenberg71 commented 4 years ago

It is explained here: https://github.com/nytimes/covid-19-data

jrminter commented 4 years ago

Please - if you are going to add the nebulous "probable deaths" do it as an additonal column in the data set so we have both metrics

darrinhomme commented 4 years ago

Please - if you are going to add the nebulous "probable deaths" do it as an additional column in the data set so we have both metrics

At a minimum, don't pile old data into one day. Wait until you have the details.

trajanmcgill commented 4 years ago

As someone with an actively used visualization based on this data, one oriented toward trying to identify hotspots visually, I concur that adding as a separate field if possible would be a much preferable way to go. One really big reason is that neighboring jurisdictions reporting two very different facts will make one's status look entirely out of proper proportion to the other. I would rather have the ability to display "unknown" for the places where one or the other metric is unknown, so users know and can choose what they are looking at. With a "confirmed" column and a "probable" or "probable plus confirmed" column, with null values indicating where one or the other of those things is not available, those of us mapping this data would have a lot more ability to ensure users get as accurate as possible an understanding of what they are seeing. Thanks.

MichaelGriebe commented 4 years ago

As someone that has been building a model estimating the effect of social distancing and masks on transmission rates, I agree, it would be great if the columns were separated (one for "confirmed" and the other for "probable"), or however you want to do it. I can subtract it myself if I know how many fall under the new definition.

sbulen commented 4 years ago

The massive spike in deaths isn't even depicted on the NYT site, that is supposed to be driven by this data: https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html#states

The jump of New York City deaths by 5600+ overnight is the outlier. No rolling average: image

New deaths. No rolling average: image

albertsun commented 4 years ago

Hi folks, apologies for this. We realize a big one day spike like this is not great for anyone.

Unfortunately, a change over in our internal data collection and processing system meant we had to roll this out today. And we were left with either creating a change over like this with one day spikes in certain areas, or potentially introducing many many revisions to past days' data with potential inaccuracies then, or not publishing updated data. We chose to break the data on May 6th.

We are working as fast as we can and plan both to revise upwards the number of NYC deaths on past day's to smooth out this spike, and to add more detailed data showing confirmed and probable cases separately.

Mopholo commented 4 years ago

Can you point us to this new way of doing things, this seems very weird to suddenly uncover all of these new deaths. I would like to see where this is coming from and an explanation of it if you have links.

Can you also tell us which areas are problematic as of now so we can manually ignore them?

Yes it's irritating, but at least your guys data, up to now, has not been riddled with massive problems like other datasets I have seen, so kudos for that.

albertsun commented 4 years ago

To add a little more detail:

Our goal is to have separate columns for confirmed, probable and total cases, following the definition for probable put forward by the Council of State and Territorial Epidemiologists which more places are adopting. However the adoption has proceeded at different rates and not all places have always used the same definition, or even made it clear what the numbers they are publishing represent. And because many of our state and national level figures are adjusted based or summed based on county level reports, we are now running into issues summing numbers with different definitions.

Some, but not all of these inconsistencies are in the Geographic Exceptions section of the README already. It's a quickly changing situation and we have not always been able to keep up.

We'd like to be sure that we are following the definition we have for cases and deaths as closely as possible but that's going to take more historical research going back to determine when the places we've been gathering data from have changed definitions.

jrminter commented 4 years ago

Thank you for a helpful explanation.

On Thu, May 7, 2020 at 6:54 PM Albert Sun notifications@github.com wrote:

To add a little more detail:

Our goal is to have separate columns for confirmed, probable and total cases, following the definition for probable put forward by the Council of State and Territorial Epidemiologists https://cdn.ymaws.com/www.cste.org/resource/resmgr/2020ps/Interim-20-ID-01_COVID-19.pdf which more places are adopting. However the adoption has proceeded at different rates and not all places have always used the same definition, or even made it clear what the numbers they are publishing represent. And because many of our state and national level figures are adjusted based or summed based on county level reports, we are now running into issues summing numbers with different definitions.

Some, but not all of these inconsistencies are in the Geographic Exceptions section of the README already. It's a quickly changing situation and we have not always been able to keep up.

We'd like to be sure that we are following the definition we have for cases and deaths as closely as possible but that's going to take more historical research going back to determine when the places we've been gathering data from have changed definitions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nytimes/covid-19-data/issues/261#issuecomment-625537294, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIU76IS4ICOVCPASRYA5ZDRQM32ZANCNFSM4M3ROLJA .

albertsun commented 4 years ago

We've posted a more extensive explanation of this change here: https://github.com/nytimes/covid-19-data/blob/master/PROBABLE-CASES-NOTE.md

albertsun commented 4 years ago

As of yesterday's update there is no longer a large single day increase in deaths.