globaldothealth / list

Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
MIT License
39 stars 8 forks source link

Tool to create spreadsheet of basic stats for each country in covid list directory #2934

Closed ChuckConnell closed 1 year ago

ChuckConnell commented 1 year ago

Right now, it is hard to scan the country directory to see if any country is behind on ingestion date or is missing significant data.

Create a tool that scans all the country CSV files and writes a spreadsheet with: total rows/cases for this country, last date found on a case, total deaths in the country.

The death count can be used as a quick sanity check on the data, because it is easy to look up elsewhere the known deaths on any date in any country.

ChuckConnell commented 1 year ago

Here is the code I wrote that solves this issue...

# Computes some basic statistics from the Global.health covid dataset, then puts them in a 
# spreadsheet for easy scanning. 

# The spreadsheet is a sanity check on the g.h covid data. For each country: number of case/list rows,
# number of deaths, date of latest case in file. 

# Chuck Connell, November 2022

import os
import fnmatch
import pandas as pd 

# If set to >1, gives fast close approximations. 1 means read all the data, no skipping. 
# Just make sure to multiply results by this number.

SAMPLE_SKIP = 1

# Directory that holds all the g.h CSV country files.  (input)

GH_COUNTRY_DIR = "/Users/chuck/Desktop/COVID Programming/Global.Health/gh_2022-11-14/country/"

# The spreadsheet file we create. (output)

GH_SUMMARY_FILE = "gh_summary.tsv"

#  EXECUTABLE CODE

# Show the skip count, if any.

print ("\nSkip count: " + str(SAMPLE_SKIP))

# Get all the file objects in the input directory.

files = os.scandir(GH_COUNTRY_DIR)

# Make a dataframe that will hold the output.

summary_DF = pd.DataFrame(data=None, dtype=str, columns=["file", "latest_case", "rows", "deaths"])

# Loop over all the files in the input directory.

for f in files:

    # Throw out files we don't want.

    if not (f.is_file()): continue
    if not (fnmatch.fnmatch(f, "*.csv")): continue

    # Get  and show name of this file.

    print ("\nWorking on: " + f.name)
    GH_PATH = GH_COUNTRY_DIR + f.name

    # Find the number of rows and last date.

    gh_DF = pd.read_csv(GH_PATH, sep=',', header=0, dtype=str, skiprows=(lambda i : i % SAMPLE_SKIP != 0))
    rows = str(gh_DF.shape[0])
    latest = str(gh_DF["events.confirmed.date"].max())

    # Find the number of deaths in this file.

    gh_DF["events.outcome.value"] = gh_DF["events.outcome.value"].str.lower()
    gh_deaths_DF = gh_DF[gh_DF["events.outcome.value"].isin(["died","death"])]
    deaths = str(gh_deaths_DF.shape[0])

    # Append info for this file to the overall output spreadsheet.

    summary_DF = summary_DF.append({"file":f.name, "latest_case":latest, "rows":rows, "deaths":deaths}, ignore_index=True)

# Done with file loop. Close the file list.

files.close()

# Write the spreadsheet.

print ("\nWriting summary data to " + GH_SUMMARY_FILE)
summary_DF.to_csv(GH_SUMMARY_FILE, encoding='utf-8', sep='\t', index=False)
jim-sheldon commented 1 year ago

Thanks @ChuckConnell I would like to turn this into a monitoring tool so that anyone interested can see the status of the whole system at a glance.

ChuckConnell commented 1 year ago

Here is the spreadsheet created by this tool today...

file    latest_case rows    deaths
MN.csv  2020-03-10  2   0
MY.csv  2020-02-15  30  0
AM.csv  2020-03-01  2   0
AZ.csv  2020-03-05  6   0
BA.csv  2020-05-27  2434    0
TZ.csv  2020-03-26  10  0
CD.csv  2020-03-27  60  0
AL.csv  2020-05-27  1054    0
MX.csv  2022-03-31  5278617 312717
LK.csv  2020-01-27  2   0
LI.csv  2020-03-02  1   0
NA.csv  2020-03-27  9   0
MZ.csv  2020-03-26  7   0
CF.csv  2020-03-27  6   0
TN.csv  2020-02-02  2   0
CG.csv  2020-03-27  63  0
AO.csv  2020-03-27  4   0
BT.csv  2020-02-05  2   0
ML.csv  2020-03-27  9   0
BG.csv  2020-03-08  4   0
UY.csv  2020-04-06  415 0
ZA.csv  2020-05-10  10014   0
VC.csv  2020-03-11  1   0
CU.csv  2021-07-04  207318  0
BF.csv  2020-03-27  211 0
NG.csv  2020-05-19  6403    0
MK.csv  2020-05-27  2039    0
NP.csv  2020-01-24  1   0
BS.csv  2020-05-11  90  0
XK.csv  2020-03-14  5   0
TH.csv  2021-01-14  11449   0
BR.csv  2022-03-19  3853973 611212
CV.csv  2020-03-25  4   0
BE.csv  2020-05-27  59259   0
CA.csv  2021-03-23  61983   2
IR.csv  2020-03-05  7026    4
IE.csv  2020-06-16  22181   0
KZ.csv  2020-03-30  49  0
GN.csv  2020-03-27  8   0
RO.csv  2020-03-09  25  0
SK.csv  2020-05-27  1515    0
EG.csv  2020-06-07  33931   0
IS.csv  2020-03-07  34  0
HU.csv  2020-05-27  3791    0
IQ.csv  2020-03-03  36  0
GM.csv  2020-03-18  1   0
ER.csv  2020-03-26  6   0
FI.csv  2020-06-04  6886    0
EE.csv  2022-07-12  585537  0
QA.csv  2020-03-07  18  0
PE.csv  2021-07-03  2052334 0
SI.csv  2020-06-15  1496    0
JO.csv  2020-03-02  2   0
GH.csv  2020-04-09  363 0
PA.csv  2020-03-09  2   0
SZ.csv  2020-03-26  9   0
SM.csv  2020-02-28  2   1
GI.csv  2020-03-03  1   0
DE.csv  2022-03-28  20561131    128764
KH.csv  2020-01-27  2   0
EC.csv  2020-03-08  8   0
ET.csv  2020-05-06  132 0
SN.csv  2020-03-27  111 0
PT.csv  2020-03-23  1565    0
SO.csv  2020-03-27  3   0
HR.csv  2020-05-24  2249    0
IL.csv  2020-03-02  22  0
DK.csv  2020-03-24  250 0
PY.csv  2020-04-05  114 0
SC.csv  2020-03-20  7   0
DJ.csv  2020-03-27  14  0
GQ.csv  2020-03-27  13  0
KE.csv  2020-05-05  512 0
KR.csv  2020-03-12  14544   36
HK.csv  2022-02-06  14871   215
RE.csv  2020-03-15  18  0
SA.csv  2020-03-05  8   0
PL.csv  2020-03-23  214 0
RS.csv  2020-05-27  11272   0
GE.csv  2020-03-17  66  0
GR.csv  2020-03-07  29  0
IN.csv  2020-04-19  20302   33
DZ.csv  2020-04-04  310 0
GA.csv  2020-03-22  6   0
SD.csv  2020-03-27  5   0
RW.csv  2020-03-27  54  0
SE.csv  2020-06-04  40444   0
GW.csv  2020-03-26  2   0
JP.csv  2021-03-29  346345  377
GB.csv  2022-03-29  3453247 3996
SG.csv  2020-03-10  328 0
PK.csv  2020-03-05  12  0
RU.csv  2020-05-09  198301  0
GT.csv  2020-04-06  34  0
KW.csv  2020-03-07  84  0
LT.csv  2020-02-28  1   0
MG.csv  2020-03-27  17  0
BH.csv  2020-03-07  68  0
CL.csv  2020-03-31  1503    0
AD.csv  2020-03-02  1   0
TR.csv  2020-05-27  159779  0
UA.csv  2020-06-14  31135   0
TD.csv  2020-03-26  5   0
CZ.csv  2022-03-28  3799151 0
AE.csv  2020-02-27  40  0
CM.csv  2020-03-27  84  0
AR.csv  2022-02-13  8726835 124088
LB.csv  2020-03-02  12  0
LU.csv  2020-03-06  3   0
MD.csv  2020-06-15  11067   0
CO.csv  2022-11-08  6041790 145275
VN.csv  2020-03-07  32  0
ZM.csv  2020-03-27  22  0
TG.csv  2020-03-27  14  0
AF.csv  2020-03-07  8   0
BJ.csv  2020-03-22  5   0
CN.csv  2022-02-04  6   0
ME.csv  2020-05-05  322 0
MR.csv  2020-03-26  3   0
OM.csv  2020-03-04  30  0
MA.csv  2020-03-04  4   0
NZ.csv  2022-10-23  1811562 0
LR.csv  2020-03-20  3   0
MV.csv  2020-03-07  4   0
BY.csv  2020-05-27  38884   0
AU.csv  2020-03-15  542 2
UG.csv  2020-03-27  18  0
AT.csv  2020-05-25  23983   0
NL.csv  2020-03-24  4150    0
MU.csv  2020-03-27  89  0
CI.csv  2020-03-27  101 0
US.csv  2022-01-09  44402270    668660
TW.csv  2022-02-10  19426   0
VI.csv  2020-03-14  2   0
CH.csv  2022-03-14  555349  0
MT.csv  2020-03-07  1   0
NO.csv  2020-03-24  658 0
ChuckConnell commented 1 year ago

Thanks @ChuckConnell I would like to turn this into a monitoring tool so that anyone interested can see the status of the whole system at a glance.

Sounds great. It should be easy to display the TSV in a dashboard.

I suggest sorting by latest_case descending, by default, so it is easy to see what just loaded and what countries are stuck in some way.

jim-sheldon commented 1 year ago

It should be easy to display the TSV in a dashboard.

Yes, and to include with each row notes about status. Some countries don't make their data available anymore, which is also valuable in an at-a-glance view.

ChuckConnell commented 1 year ago

Yes, and to include with each row notes about status. Some countries don't make their data available anymore, which is also valuable in an at-a-glance view.

A notes field is a good idea.

jim-sheldon commented 1 year ago

I made a new repo for this and started porting the above code to query our database directly and put the results in a new db collection and S3 bucket.

ChuckConnell commented 1 year ago

I made a new repo for this and started porting the above code to query our database directly and put the results in a new db collection and S3 bucket.

This is great. It will definitely catch a country that has not updated for X days, or a country that reports an unrealistically small number of cases.

But we should be aware that it will not catch a problem between the internal case/list database and the final (user facing) CSV files.

jim-sheldon commented 1 year ago

we should be aware that it will not catch a problem between the internal case/list database and the final (user facing) CSV files

Agreed. We have a separate script to output CSV files, and that also needs monitoring (we use Slack).

jim-sheldon commented 1 year ago

We now have a new database collection and S3 bucket containing a monitoring report: country, case counts, last updated date, and notes. I need to clean it up, push it, deploy, and automate.

ChuckConnell commented 1 year ago

We now have a new database collection and S3 bucket containing a monitoring report: country, case counts, last updated date, and notes. I need to clean it up, push it, deploy, and automate.

That sounds great. Quick work!

The total deaths per country would be nice also, as a sanity check on how accurate the data is. For example, New Zealand reports cases through Oct 2022 but zero deaths. Something is clearly wrong with this data, since there are actually about 2000 deaths so far in NZ.

jim-sheldon commented 1 year ago

Something is clearly wrong with this data, since there are actually about 2000 deaths so far in NZ

I think that may be a separate problem, and I would like to roll that work into another item. We would need to investigate more.

jim-sheldon commented 1 year ago

Deployed on AWS Batch and automated via Eventbridge. @ChuckConnell results live in an S3 bucket now, LMK if you need help accessing.

ChuckConnell commented 1 year ago

Deployed on AWS Batch and automated via Eventbridge. @ChuckConnell results live in an S3 bucket now, LMK if you need help accessing.

Great! Yes, a link to that S3 bucket would be helpful.

jim-sheldon commented 1 year ago

Thanks for creating new issues in the new repo. I posted the S3 object location there.