Write some code to clean up the data files in historical_data

rharron commented 4 years ago

In Issue #2, I'd mentioned that maybe these files were updated more than once a day, or other issues. Looking at the git log it looks like the data-by-modzcta.csv file has been updated exactly once a day (between 12pm and 3:30pm Eastern time) since it was created on May 18. The test-by-zcta.csv file is almost perfect but not quite:

It was not updated on 06/03
The commit ede6f90 on 05/25 deleted the file, but the next commit re-added (I didn't check if it updated the data).
There are two commits on 05/19 and none on 05/18, but the first on 05/19 claims to be the 05/18 update, so that's fine.
It wasn't updated on 04/06 (my bday!!) no 04/02.

Note that test-by-zcta.csv was first added on 04/01, so that's probably as far back as we can go with our visualization.

tjcombs commented 4 years ago

Interesting. If there is a way to extract the commit messages and the date the commit happened, then it seems like we can get the most correct version for each day by choosing the most recent commit for that day.

tjcombs commented 4 years ago

https://github.com/rharron/CovidVisualization/tree/clean_data

Started a branch to work on cleaning up the data

rharron commented 4 years ago

Yeah I can get the date and commit message by slightly modifying bash script I wrote. That'll work pretty well with the modztca file, but we might just have to hardcode workarounds for the past issues with the tests file.

rharron commented 4 years ago

I'm working on tweaking the bash script

rharron commented 4 years ago

Okay, I've gotten the script so that the outputted files contain the data and time (and are ordered from oldest to newest). E.g. data-by-modzcta.csv.025.2020-06-1212:56:42-0400_b92f6e5.csv data-by-modzcta.csv.024.2020-06-1112:54:29-0400_b6ae2b9.csv data-by-modzcta.csv.023.2020-06-1012:50:30-0400_b820b68.csv data-by-modzcta.csv.022.2020-06-0912:37:29-0400_5ecc5d1.csv data-by-modzcta.csv.021.2020-06-0812:56:54-0400_3f21405.csv data-by-modzcta.csv.020.2020-06-0712:04:18-0400_6328e0b.csv data-by-modzcta.csv.019.2020-06-0613:03:24-0400_9e0c1fe.csv data-by-modzcta.csv.018.2020-06-0512:59:12-0400_582977c.csv data-by-modzcta.csv.017.2020-06-0413:44:39-0400_f094fb2.csv data-by-modzcta.csv.016.2020-06-0312:57:04-0400_eb3b8e9.csv data-by-modzcta.csv.015.2020-06-0212:54:17-0400_53f5d79.csv data-by-modzcta.csv.014.2020-06-0113:03:36-0400_62444c1.csv data-by-modzcta.csv.013.2020-05-3112:59:33-0400_9b5cd4d.csv data-by-modzcta.csv.012.2020-05-3013:02:51-0400_3e9a27c.csv data-by-modzcta.csv.011.2020-05-2912:55:50-0400_8636c55.csv data-by-modzcta.csv.010.2020-05-2812:55:35-0400_65efb1f.csv data-by-modzcta.csv.009.2020-05-2712:58:01-0400_498a068.csv data-by-modzcta.csv.008.2020-05-2614:09:46-0400_d52fdfe.csv data-by-modzcta.csv.007.2020-05-2513:22:07-0400_f19c0bc.csv data-by-modzcta.csv.006.2020-05-2413:17:49-0400_9332798.csv data-by-modzcta.csv.005.2020-05-2313:28:17-0400_8d88b2c.csv data-by-modzcta.csv.004.2020-05-2213:41:28-0400_3cbb3b7.csv data-by-modzcta.csv.003.2020-05-2113:36:17-0400_d3a1873.csv data-by-modzcta.csv.002.2020-05-2012:39:17-0400_0c4a03c.csv data-by-modzcta.csv.001.2020-05-1914:15:14-0400_a68f42f.csv data-by-modzcta.csv.000.2020-05-1815:22:45-0400_50e60ee.csv

How does that sound?

tjcombs commented 4 years ago

Sounds good to me!

rharron commented 4 years ago

I had one question about the clean data code I just merged: is there somewhere where you tell the program to only deal with the .csv files? Like what happens if other files end up in that folder, will the code try to process them, too?

tjcombs commented 4 years ago

Yes, it think it will try to process the non csv files and error. I can add a filter so that it only tries to read if the file name ends with .csv

tjcombs commented 4 years ago

Ok, updated the function so that it only tries to read csv files and pushed to master. Took it a step further so that it only considers files which start with data-by-modzcta or tests-by-zcta. This is so that you can put other csv files in that folder if we choose to.

tjcombs commented 4 years ago

Took this data and plotted the total number of covid cases for some of the modified zcta. This is what I got:

(Code can be found in the total_covid_plot.py script in the plots branch)

I find this odd because I would expect this plot to be increasing. This spike seems to suggest there is a data quality issue some time in the last week of April.

rharron commented 4 years ago

Yeah that's weird! What day is that? Maybe there's something in the commit messages or in the news about it.

tjcombs commented 4 years ago

Took at look at each of the modified zcta and found which dates had a local maximum.

Data Date	Number of local maximums
2020-04-26	176
2020-05-10	1
2020-05-20	57
2020-05-28	2
2020-06-02	1
2020-06-07	1
2020-06-08	113

The record on 5/10:

MODIFIED_ZCTA	DATA_DATE	TOTAL
10282	2020-05-08	184
10282	2020-05-10	197
10282	2020-05-11	196

Some records from 6/8:

MODIFIED_ZCTA	DATA_DATE	TOTAL
10001	2020-06-07	2342
10001	2020-06-08	2380
10001	2020-06-09	2373
10002	2020-06-07	4986
10002	2020-06-08	5062
10002	2020-06-09	5048
10009	2020-06-07	4566
10009	2020-06-08	4644
10009	2020-06-09	4629
10011	2020-06-07	3764
10011	2020-06-08	3835
10011	2020-06-09	3830

Some of these might be because of rounding errors when I created the TOTAL column when the csv didn't have it

data['TOTAL'] = (100*data['COVID_CASE_COUNT']/data['PERCENT_POSITIVE']).apply(lambda x: round(x) if pd.notnull(x) else x)

rharron commented 4 years ago

Here's some "fine details" that need to be cleaned:

on 2020-04-08 and 2020-04-09: there's a 99999 zip code included here, but nowhere else
on 2020-04-10: the last zip code (11697) occurs twice. Once with the values of the previous and once with values that seem like they could be correct.

rharron commented 4 years ago

Added some code to clean_data.py to address the above "fine cleaning" in da71a8ede3a5769660ae8672cf2c351ebfe16bf4

rharron commented 4 years ago

The 04/26 data spike is an error the nychealth people are aware of and they are not correcting it(see here: https://github.com/nychealth/coronavirus-data/commit/3ec3fa97d44c5b3054477c9c0998fa6d466bca72#r39152282 and the issue referenced there). Basically, they're like "we corrected it the next day" by which they mean "the 04/27 update is correct and so who cares if the 04/26 update is correct!?". It's a somewhat frustrating interaction to read in their issues. So, I propose we just trash the 04/26 in the cleaning function. Sound good?

tjcombs commented 4 years ago

Wow, haha. Yeah, that seems like the only thing we can do.

tjcombs commented 4 years ago

Ok, updated the code to drop the data on 4/26

rharron commented 4 years ago

Sweet! Thanks!

rharron commented 4 years ago

I noticed an issue with the data on March 25th being the same as that on March 24th and I've traced it back to an issue with the bash file I wrote to get the files (turns out git log only shows you files changed in that commit, not all files present at that point). I'm almost done fixing this by reimplementing the bash script using gitpython. This will make the bash script unnecessary and make your clean up code shorter! But right now, I'm going to bed!

rharron commented 4 years ago

Alright, reimplementation using gitpython is done (in e3621601a86363effd90b4e959d70174c6475c1b)!

rharron commented 4 years ago

I think the last thing to do about this issue is to decide whether to have the output be a multiindex dataframe or the dataframe we currently have. My preference would be for the multiindex since that seems like the natural way for the data to be presented and it makes it easier to run statistics that vary with time. Thoughts?

tjcombs commented 4 years ago

Pretty neat that you were able to get all the data at once all from within python!

rharron commented 4 years ago

Oh thank you!

Once I saw how concise it could be, I was quite excited!

rharron commented 4 years ago

Now that we settled the issue of multiindices (#8 ), I feel like we can close this issue.

rharron / CovidVisualization

Write some code to clean up the data files in historical_data #4