Closed rharron closed 4 years ago
Interesting. If there is a way to extract the commit messages and the date the commit happened, then it seems like we can get the most correct version for each day by choosing the most recent commit for that day.
https://github.com/rharron/CovidVisualization/tree/clean_data
Started a branch to work on cleaning up the data
Yeah I can get the date and commit message by slightly modifying bash script I wrote. That'll work pretty well with the modztca file, but we might just have to hardcode workarounds for the past issues with the tests file.
I'm working on tweaking the bash script
Okay, I've gotten the script so that the outputted files contain the data and time (and are ordered from oldest to newest). E.g. data-by-modzcta.csv.025.2020-06-1212:56:42-0400_b92f6e5.csv data-by-modzcta.csv.024.2020-06-1112:54:29-0400_b6ae2b9.csv data-by-modzcta.csv.023.2020-06-1012:50:30-0400_b820b68.csv data-by-modzcta.csv.022.2020-06-0912:37:29-0400_5ecc5d1.csv data-by-modzcta.csv.021.2020-06-0812:56:54-0400_3f21405.csv data-by-modzcta.csv.020.2020-06-0712:04:18-0400_6328e0b.csv data-by-modzcta.csv.019.2020-06-0613:03:24-0400_9e0c1fe.csv data-by-modzcta.csv.018.2020-06-0512:59:12-0400_582977c.csv data-by-modzcta.csv.017.2020-06-0413:44:39-0400_f094fb2.csv data-by-modzcta.csv.016.2020-06-0312:57:04-0400_eb3b8e9.csv data-by-modzcta.csv.015.2020-06-0212:54:17-0400_53f5d79.csv data-by-modzcta.csv.014.2020-06-0113:03:36-0400_62444c1.csv data-by-modzcta.csv.013.2020-05-3112:59:33-0400_9b5cd4d.csv data-by-modzcta.csv.012.2020-05-3013:02:51-0400_3e9a27c.csv data-by-modzcta.csv.011.2020-05-2912:55:50-0400_8636c55.csv data-by-modzcta.csv.010.2020-05-2812:55:35-0400_65efb1f.csv data-by-modzcta.csv.009.2020-05-2712:58:01-0400_498a068.csv data-by-modzcta.csv.008.2020-05-2614:09:46-0400_d52fdfe.csv data-by-modzcta.csv.007.2020-05-2513:22:07-0400_f19c0bc.csv data-by-modzcta.csv.006.2020-05-2413:17:49-0400_9332798.csv data-by-modzcta.csv.005.2020-05-2313:28:17-0400_8d88b2c.csv data-by-modzcta.csv.004.2020-05-2213:41:28-0400_3cbb3b7.csv data-by-modzcta.csv.003.2020-05-2113:36:17-0400_d3a1873.csv data-by-modzcta.csv.002.2020-05-2012:39:17-0400_0c4a03c.csv data-by-modzcta.csv.001.2020-05-1914:15:14-0400_a68f42f.csv data-by-modzcta.csv.000.2020-05-1815:22:45-0400_50e60ee.csv
How does that sound?
Sounds good to me!
I had one question about the clean data code I just merged: is there somewhere where you tell the program to only deal with the .csv files? Like what happens if other files end up in that folder, will the code try to process them, too?
Yes, it think it will try to process the non csv files and error. I can add a filter so that it only tries to read if the file name ends with .csv
Ok, updated the function so that it only tries to read csv files and pushed to master. Took it a step further so that it only considers files which start with data-by-modzcta or tests-by-zcta. This is so that you can put other csv files in that folder if we choose to.
Took this data and plotted the total number of covid cases for some of the modified zcta. This is what I got:
(Code can be found in the total_covid_plot.py script in the plots branch)
I find this odd because I would expect this plot to be increasing. This spike seems to suggest there is a data quality issue some time in the last week of April.
Yeah that's weird! What day is that? Maybe there's something in the commit messages or in the news about it.
Took at look at each of the modified zcta and found which dates had a local maximum.
Data Date | Number of local maximums |
---|---|
2020-04-26 | 176 |
2020-05-10 | 1 |
2020-05-20 | 57 |
2020-05-28 | 2 |
2020-06-02 | 1 |
2020-06-07 | 1 |
2020-06-08 | 113 |
The record on 5/10:
MODIFIED_ZCTA | DATA_DATE | TOTAL |
---|---|---|
10282 | 2020-05-08 | 184 |
10282 | 2020-05-10 | 197 |
10282 | 2020-05-11 | 196 |
Some records from 6/8:
MODIFIED_ZCTA | DATA_DATE | TOTAL |
---|---|---|
10001 | 2020-06-07 | 2342 |
10001 | 2020-06-08 | 2380 |
10001 | 2020-06-09 | 2373 |
10002 | 2020-06-07 | 4986 |
10002 | 2020-06-08 | 5062 |
10002 | 2020-06-09 | 5048 |
10009 | 2020-06-07 | 4566 |
10009 | 2020-06-08 | 4644 |
10009 | 2020-06-09 | 4629 |
10011 | 2020-06-07 | 3764 |
10011 | 2020-06-08 | 3835 |
10011 | 2020-06-09 | 3830 |
Some of these might be because of rounding errors when I created the TOTAL column when the csv didn't have it
data['TOTAL'] = (100*data['COVID_CASE_COUNT']/data['PERCENT_POSITIVE']).apply(lambda x: round(x) if pd.notnull(x) else x)
Here's some "fine details" that need to be cleaned:
Added some code to clean_data.py to address the above "fine cleaning" in da71a8ede3a5769660ae8672cf2c351ebfe16bf4
The 04/26 data spike is an error the nychealth people are aware of and they are not correcting it(see here: https://github.com/nychealth/coronavirus-data/commit/3ec3fa97d44c5b3054477c9c0998fa6d466bca72#r39152282 and the issue referenced there). Basically, they're like "we corrected it the next day" by which they mean "the 04/27 update is correct and so who cares if the 04/26 update is correct!?". It's a somewhat frustrating interaction to read in their issues. So, I propose we just trash the 04/26 in the cleaning function. Sound good?
Wow, haha. Yeah, that seems like the only thing we can do.
Ok, updated the code to drop the data on 4/26
Sweet! Thanks!
I noticed an issue with the data on March 25th being the same as that on March 24th and I've traced it back to an issue with the bash file I wrote to get the files (turns out git log only shows you files changed in that commit, not all files present at that point). I'm almost done fixing this by reimplementing the bash script using gitpython. This will make the bash script unnecessary and make your clean up code shorter! But right now, I'm going to bed!
Alright, reimplementation using gitpython is done (in e3621601a86363effd90b4e959d70174c6475c1b)!
I think the last thing to do about this issue is to decide whether to have the output be a multiindex dataframe or the dataframe we currently have. My preference would be for the multiindex since that seems like the natural way for the data to be presented and it makes it easier to run statistics that vary with time. Thoughts?
Pretty neat that you were able to get all the data at once all from within python!
Oh thank you!
Once I saw how concise it could be, I was quite excited!
Now that we settled the issue of multiindices (#8 ), I feel like we can close this issue.
In Issue #2, I'd mentioned that maybe these files were updated more than once a day, or other issues. Looking at the git log it looks like the data-by-modzcta.csv file has been updated exactly once a day (between 12pm and 3:30pm Eastern time) since it was created on May 18. The test-by-zcta.csv file is almost perfect but not quite:
Note that test-by-zcta.csv was first added on 04/01, so that's probably as far back as we can go with our visualization.