nytimes / covid-19-data

A repository of data on coronavirus cases and deaths in the U.S.
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html
Other
6.99k stars 3.46k forks source link

counties.csv hasn't updated in 3 days #677

Closed mjwebster closed 2 years ago

mjwebster commented 2 years ago

Describe the issue:

Last update on the counties.csv file seems to be 5/13/22, which is 3 days ago. Other files appear to have been updated earlier this morning.

tiffehr commented 2 years ago

Howdy, @mjwebster! Fan of your work. 😄 We've run into a limit with GitHub's raw file uploads.

This file is now updating. For anyone following along, the original us-counties.csv file is now almost at or over the Github file limit and will soon stop updating. Originally posted by @albertsun in https://github.com/nytimes/covid-19-data/issues/674#issuecomment-1123957366

We're recommending people use the year-based county files from now on.

mjwebster commented 2 years ago

Thank you @tiffehr ! I was wondering if you guys might hit that at some point. It certainly is a lot of data.

bdklahn commented 2 years ago

Thank you to @tiffehr , and everyone here, for maintaining this data source for people!!!

Yes . . . I wondered, when that file size got to exactly 100 MB . . . :-) -and I did see the clear explanation in the README, about this, before chiming in here.

I know that for a lot of folks it breaks their "API" to have that need to change to us-counties-2022.csv (e.g.) Ideally, there are better ways to store snapshot-like data, than in text files (e.g. csv). I don't think people can assume there is any promise of a consistent "API", here. But the change sneaks up on people, where the exact same file name, now kinda means a different data asset/expectation.

I wonder if it would mitigate things for folks if, say, us-counties.csv became a symlink to the latest year csv. (Probably doing a git pull, on a local repo copy, is less bandwidth-intensive than always streaming down the full, uncomplressed, github raw version of that file. Git compacts things before any transfer operations, and only changes should be sent.)

Something like . . .

git mv us-counties.csv us-counties-full-legacy.csv ln -s us-counties-2022.csv us-counties.csv git add us-counties.csv git commit git push

(might need/want two commit steps, due to same name for tracked old csv and new symlink)

Even if the old file is "deleted" with a git rm (to make way for a symlink of the same name), it is still easily accessible via the git history, if needed.

albertsun commented 2 years ago

@bdklahn we wouldn't do that because the new file is not of the same format as it does not contain the whole history of the file.

Unfortunately we think in this case it's best for people to manually see the change and update any processes they are running using the data to use the new format.

bdklahn commented 2 years ago

I understand. But the update already changes the fundamental format of that file to no longer contain the whole history to date. So, I wondered, since the fundamental format was already changed, if another version of fundamental change might be less disruptive for folks. If folks need older data, they can always go back to a previous git snapshot to pull that big file, or whatever.

Anyway, I can pretty easily adjust local scripts, etc., to reconstruct what us-counties.csv used to be, if necessary. I just wondered . . .

Simple file names like us-counties.csv, in a regularly updated git repo, (v.s., say, something like us-countiies-YYYY-MM-DD.csv) are often inferred by people to mean "current data". Anyway all this COVID time series data stuff . . . is hard to snapshot, anyway, given "backfill" updates which change "history", and similar.

So I appreciate any of this data wrangling effort you folks are already doing.

We'll deal with it.

Thank you!

gwillen commented 2 years ago

It looks like this caused the Google Cloud "New York Times US Coronavirus Database" dataset to stop updating at the same time: https://console.cloud.google.com/marketplace/product/the-new-york-times/covid19_us_cases

That page says it's based on the data in this repo, but I don't know who maintains it -- is it possible to get it updating again? A tool I use, covid-19.direct, has stopped updating in turn from the Google Cloud dataset.

Thanks!

tiffehr commented 2 years ago

@gwillen We don't know who owns that (but it's fun to see it listed and use it in BigQuery's SQL workspace).

Update: based on the Marketplace page, looks like Google runs it in order to promote BQ.

https://mail.google.com/mail/u/0/?view=cm&fs=1&to=public-data-help@google.com&su=Public%20Datasets%20Issue:%20[INSERT%20ISSUE%20SUBJECT%20HERE]....

I'll ping that email and see who responds.

gwillen commented 2 years ago

@gwillen We don't know who owns that (but it's fun to see it listed and use it in BigQuery's SQL workspace).

Update: based on the Marketplace page, looks like Google runs it in order to promote BQ.

https://mail.google.com/mail/u/0/?view=cm&fs=1&to=public-data-help@google.com&su=Public%20Datasets%20Issue:%20[INSERT%20ISSUE%20SUBJECT%20HERE]....

I'll ping that email and see who responds.

Thanks very much!! I greatly appreciate your help.