Closed ChuckConnell closed 1 year ago
Here is the code I wrote that solves this issue...
# Computes some basic statistics from the Global.health covid dataset, then puts them in a
# spreadsheet for easy scanning.
# The spreadsheet is a sanity check on the g.h covid data. For each country: number of case/list rows,
# number of deaths, date of latest case in file.
# Chuck Connell, November 2022
import os
import fnmatch
import pandas as pd
# If set to >1, gives fast close approximations. 1 means read all the data, no skipping.
# Just make sure to multiply results by this number.
SAMPLE_SKIP = 1
# Directory that holds all the g.h CSV country files. (input)
GH_COUNTRY_DIR = "/Users/chuck/Desktop/COVID Programming/Global.Health/gh_2022-11-14/country/"
# The spreadsheet file we create. (output)
GH_SUMMARY_FILE = "gh_summary.tsv"
# EXECUTABLE CODE
# Show the skip count, if any.
print ("\nSkip count: " + str(SAMPLE_SKIP))
# Get all the file objects in the input directory.
files = os.scandir(GH_COUNTRY_DIR)
# Make a dataframe that will hold the output.
summary_DF = pd.DataFrame(data=None, dtype=str, columns=["file", "latest_case", "rows", "deaths"])
# Loop over all the files in the input directory.
for f in files:
# Throw out files we don't want.
if not (f.is_file()): continue
if not (fnmatch.fnmatch(f, "*.csv")): continue
# Get and show name of this file.
print ("\nWorking on: " + f.name)
GH_PATH = GH_COUNTRY_DIR + f.name
# Find the number of rows and last date.
gh_DF = pd.read_csv(GH_PATH, sep=',', header=0, dtype=str, skiprows=(lambda i : i % SAMPLE_SKIP != 0))
rows = str(gh_DF.shape[0])
latest = str(gh_DF["events.confirmed.date"].max())
# Find the number of deaths in this file.
gh_DF["events.outcome.value"] = gh_DF["events.outcome.value"].str.lower()
gh_deaths_DF = gh_DF[gh_DF["events.outcome.value"].isin(["died","death"])]
deaths = str(gh_deaths_DF.shape[0])
# Append info for this file to the overall output spreadsheet.
summary_DF = summary_DF.append({"file":f.name, "latest_case":latest, "rows":rows, "deaths":deaths}, ignore_index=True)
# Done with file loop. Close the file list.
files.close()
# Write the spreadsheet.
print ("\nWriting summary data to " + GH_SUMMARY_FILE)
summary_DF.to_csv(GH_SUMMARY_FILE, encoding='utf-8', sep='\t', index=False)
Thanks @ChuckConnell I would like to turn this into a monitoring tool so that anyone interested can see the status of the whole system at a glance.
Here is the spreadsheet created by this tool today...
file latest_case rows deaths
MN.csv 2020-03-10 2 0
MY.csv 2020-02-15 30 0
AM.csv 2020-03-01 2 0
AZ.csv 2020-03-05 6 0
BA.csv 2020-05-27 2434 0
TZ.csv 2020-03-26 10 0
CD.csv 2020-03-27 60 0
AL.csv 2020-05-27 1054 0
MX.csv 2022-03-31 5278617 312717
LK.csv 2020-01-27 2 0
LI.csv 2020-03-02 1 0
NA.csv 2020-03-27 9 0
MZ.csv 2020-03-26 7 0
CF.csv 2020-03-27 6 0
TN.csv 2020-02-02 2 0
CG.csv 2020-03-27 63 0
AO.csv 2020-03-27 4 0
BT.csv 2020-02-05 2 0
ML.csv 2020-03-27 9 0
BG.csv 2020-03-08 4 0
UY.csv 2020-04-06 415 0
ZA.csv 2020-05-10 10014 0
VC.csv 2020-03-11 1 0
CU.csv 2021-07-04 207318 0
BF.csv 2020-03-27 211 0
NG.csv 2020-05-19 6403 0
MK.csv 2020-05-27 2039 0
NP.csv 2020-01-24 1 0
BS.csv 2020-05-11 90 0
XK.csv 2020-03-14 5 0
TH.csv 2021-01-14 11449 0
BR.csv 2022-03-19 3853973 611212
CV.csv 2020-03-25 4 0
BE.csv 2020-05-27 59259 0
CA.csv 2021-03-23 61983 2
IR.csv 2020-03-05 7026 4
IE.csv 2020-06-16 22181 0
KZ.csv 2020-03-30 49 0
GN.csv 2020-03-27 8 0
RO.csv 2020-03-09 25 0
SK.csv 2020-05-27 1515 0
EG.csv 2020-06-07 33931 0
IS.csv 2020-03-07 34 0
HU.csv 2020-05-27 3791 0
IQ.csv 2020-03-03 36 0
GM.csv 2020-03-18 1 0
ER.csv 2020-03-26 6 0
FI.csv 2020-06-04 6886 0
EE.csv 2022-07-12 585537 0
QA.csv 2020-03-07 18 0
PE.csv 2021-07-03 2052334 0
SI.csv 2020-06-15 1496 0
JO.csv 2020-03-02 2 0
GH.csv 2020-04-09 363 0
PA.csv 2020-03-09 2 0
SZ.csv 2020-03-26 9 0
SM.csv 2020-02-28 2 1
GI.csv 2020-03-03 1 0
DE.csv 2022-03-28 20561131 128764
KH.csv 2020-01-27 2 0
EC.csv 2020-03-08 8 0
ET.csv 2020-05-06 132 0
SN.csv 2020-03-27 111 0
PT.csv 2020-03-23 1565 0
SO.csv 2020-03-27 3 0
HR.csv 2020-05-24 2249 0
IL.csv 2020-03-02 22 0
DK.csv 2020-03-24 250 0
PY.csv 2020-04-05 114 0
SC.csv 2020-03-20 7 0
DJ.csv 2020-03-27 14 0
GQ.csv 2020-03-27 13 0
KE.csv 2020-05-05 512 0
KR.csv 2020-03-12 14544 36
HK.csv 2022-02-06 14871 215
RE.csv 2020-03-15 18 0
SA.csv 2020-03-05 8 0
PL.csv 2020-03-23 214 0
RS.csv 2020-05-27 11272 0
GE.csv 2020-03-17 66 0
GR.csv 2020-03-07 29 0
IN.csv 2020-04-19 20302 33
DZ.csv 2020-04-04 310 0
GA.csv 2020-03-22 6 0
SD.csv 2020-03-27 5 0
RW.csv 2020-03-27 54 0
SE.csv 2020-06-04 40444 0
GW.csv 2020-03-26 2 0
JP.csv 2021-03-29 346345 377
GB.csv 2022-03-29 3453247 3996
SG.csv 2020-03-10 328 0
PK.csv 2020-03-05 12 0
RU.csv 2020-05-09 198301 0
GT.csv 2020-04-06 34 0
KW.csv 2020-03-07 84 0
LT.csv 2020-02-28 1 0
MG.csv 2020-03-27 17 0
BH.csv 2020-03-07 68 0
CL.csv 2020-03-31 1503 0
AD.csv 2020-03-02 1 0
TR.csv 2020-05-27 159779 0
UA.csv 2020-06-14 31135 0
TD.csv 2020-03-26 5 0
CZ.csv 2022-03-28 3799151 0
AE.csv 2020-02-27 40 0
CM.csv 2020-03-27 84 0
AR.csv 2022-02-13 8726835 124088
LB.csv 2020-03-02 12 0
LU.csv 2020-03-06 3 0
MD.csv 2020-06-15 11067 0
CO.csv 2022-11-08 6041790 145275
VN.csv 2020-03-07 32 0
ZM.csv 2020-03-27 22 0
TG.csv 2020-03-27 14 0
AF.csv 2020-03-07 8 0
BJ.csv 2020-03-22 5 0
CN.csv 2022-02-04 6 0
ME.csv 2020-05-05 322 0
MR.csv 2020-03-26 3 0
OM.csv 2020-03-04 30 0
MA.csv 2020-03-04 4 0
NZ.csv 2022-10-23 1811562 0
LR.csv 2020-03-20 3 0
MV.csv 2020-03-07 4 0
BY.csv 2020-05-27 38884 0
AU.csv 2020-03-15 542 2
UG.csv 2020-03-27 18 0
AT.csv 2020-05-25 23983 0
NL.csv 2020-03-24 4150 0
MU.csv 2020-03-27 89 0
CI.csv 2020-03-27 101 0
US.csv 2022-01-09 44402270 668660
TW.csv 2022-02-10 19426 0
VI.csv 2020-03-14 2 0
CH.csv 2022-03-14 555349 0
MT.csv 2020-03-07 1 0
NO.csv 2020-03-24 658 0
Thanks @ChuckConnell I would like to turn this into a monitoring tool so that anyone interested can see the status of the whole system at a glance.
Sounds great. It should be easy to display the TSV in a dashboard.
I suggest sorting by latest_case descending, by default, so it is easy to see what just loaded and what countries are stuck in some way.
It should be easy to display the TSV in a dashboard.
Yes, and to include with each row notes about status. Some countries don't make their data available anymore, which is also valuable in an at-a-glance view.
Yes, and to include with each row notes about status. Some countries don't make their data available anymore, which is also valuable in an at-a-glance view.
A notes field is a good idea.
I made a new repo for this and started porting the above code to query our database directly and put the results in a new db collection and S3 bucket.
I made a new repo for this and started porting the above code to query our database directly and put the results in a new db collection and S3 bucket.
This is great. It will definitely catch a country that has not updated for X days, or a country that reports an unrealistically small number of cases.
But we should be aware that it will not catch a problem between the internal case/list database and the final (user facing) CSV files.
we should be aware that it will not catch a problem between the internal case/list database and the final (user facing) CSV files
Agreed. We have a separate script to output CSV files, and that also needs monitoring (we use Slack).
We now have a new database collection and S3 bucket containing a monitoring report: country, case counts, last updated date, and notes. I need to clean it up, push it, deploy, and automate.
We now have a new database collection and S3 bucket containing a monitoring report: country, case counts, last updated date, and notes. I need to clean it up, push it, deploy, and automate.
That sounds great. Quick work!
The total deaths per country would be nice also, as a sanity check on how accurate the data is. For example, New Zealand reports cases through Oct 2022 but zero deaths. Something is clearly wrong with this data, since there are actually about 2000 deaths so far in NZ.
Something is clearly wrong with this data, since there are actually about 2000 deaths so far in NZ
I think that may be a separate problem, and I would like to roll that work into another item. We would need to investigate more.
Deployed on AWS Batch and automated via Eventbridge. @ChuckConnell results live in an S3 bucket now, LMK if you need help accessing.
Deployed on AWS Batch and automated via Eventbridge. @ChuckConnell results live in an S3 bucket now, LMK if you need help accessing.
Great! Yes, a link to that S3 bucket would be helpful.
Thanks for creating new issues in the new repo. I posted the S3 object location there.
Right now, it is hard to scan the country directory to see if any country is behind on ingestion date or is missing significant data.
Create a tool that scans all the country CSV files and writes a spreadsheet with: total rows/cases for this country, last date found on a case, total deaths in the country.
The death count can be used as a quick sanity check on the data, because it is easy to look up elsewhere the known deaths on any date in any country.