This research was also published on Flashpub
Updated: 03 Nov 2020
Author: Marc Bevand
The Florida Department of Health (FDOH) has an amazing dataset here: complete line list of the state's COVID-19 cases and deaths. To our knowledge, it is the only open dataset documenting the exact age of hundreds of thousands of cases. This gives us great insights because age is a very important factor affecting the Case Fatality Ratio (CFR) of COVID-19. This repository provides tools to analyze the data:
To download the FDOH line list, browse the data page and click
Download
in the top right corner. However we already made daily archives of the line
list in directory data_fdoh so there is no need to download it.
The line list is in CSV format and the columns are self-explanatory: Age
,
Gender
, County
, boolean Died
, etc. The columns are documented on page 12 of this
FDOH document. Two important columns:
ChartDate
is the date used to create the bar chart in the FDOH dashboard.
EventDate
is the date when symptoms started, or if that date is unknown, the
date lab results were reported to the FDOH. One of our sources claim more
precisely this date is the earlier of onset date, diagnosis date, or test date.
Our scripts trust the FDOH and assume EventDate
is generally the onset date,
but we are aware this column may have data quality issues.
List of our previous forecasts (directory):
Our previous forecasts have been accurate, for example our first forecasts accurately predicted the increase of deaths in July:
forecast_deaths.py
does not rely on death data, but relies solely on case
ages, date of onset of symptoms, and various estimates of the age-stratified CFR.
The script starts by opening the latest line list CSV file (in directory data_fdoh)
or if no such file exists it downloads it from the FDOH line list page.
This gives us (1) the age of every case diagnosed with COVID-19, and (2) the
date of onset of symptoms (EventDate
CSV column.)
The script multiplies the numbers of cases in specific age brackets by the corresponding age-stratified CFR. The age-stratified CFR estimates are issued from multiple independent models:
age_stratified_cfr.py
Model 1 was removed on 2020-07-26 as it is based on the same case data as model 2 but without adjustments for censoring and under-ascertainment:
Model 3 was removed on 2020-07-26 as it did not prove to be as reliable as the others:
Experimentally, we have observed that deaths fall somewhere in the range of estimates produced by these CFR models (eg. model 4 tends to overestimate, and model 5 tends to underestimate.) To refine our estimates, since 2020-07-26 we also produce a best guess forecast:
adj_factor
by which model 5 estimates can be multiplied
to produce a perfect estimate for the present dayadj_factor
is multiplied by 1.05
(5% error margin), for every subsequent day adj_factor
is multiplied by 1.005
(accumulates 0.5% daily error margin)Since the forecast is based on line list case data, ie. detected cases, it is important that we feed it CFR estimates, not IFR estimates. Infection Fatality Ratios take into account undetected cases and thus would not be consistent with line list data.
Then the script assumes death occurs on average 25.1 days after infection,
which is the mean onset-to-death time calculated by gamma.py
.
Finally, it charts the forecast (forecast_deaths.png
). The curves are all
smoothed with a 7-day simple moving average.
The end result is a simple tool that can not only predict deaths up to ~25.1 days ahead of time, but can also estimate past deaths accurately: notice how the colored curves in the generated chart follow closely the observed deaths.
Historical data for observed deaths was fetched from the New York Times Covid-19 data repository and was saved in the file data_deaths/fl_resident_deaths.csv. Death data is only used to draw the black curve. It is not used in the forecasts based on CFR models #1 through #4. Actual deaths are only used indirectly in the forecast based on model #5, because model #5 uses the age-stratified CFR calculated from Florida deaths.
Note: since 2020-07-14 the file data_deaths/fl_resident_deaths.csv
is now
updated by the script data_fdoh/download that obtains deaths directly from
the FDOH line list. The number of deaths calculated by the script is off by one
from the "Florida Resident Deaths" figure shown on the state's
dashboard because our script only accounts for deaths whose Jurisdiction
is FL resident
(consistent with the way NYT does it in their
repository,) whereas the state's dashboard includes 1 additional death
whose Jurisdiction
is Not diagnosed/isolated in FL
.
age_stratified_cfr.py
opens the latest line list CSV file (in directory data_fdoh)
or if no such file exists it downloads it from the FDOH line list page.
It calculates the 14-day moving average of the raw CFR of various age brackets,
with cases ordered by date of onset of symptoms.
The script also calculates the 14-day (short-term) and 35-day (long-term) moving
average of the CFR ajusted for right censoring. The adjustment is performed by
using the parameters of the Gamma distribution of onset-to-death calculated by
gamma.py
. The long-term average is calculated over 35 days shifted by 14 days
(days D-14 through D-48) because despite best efforts to adjust for censoring,
the number of deaths in the last 14 days still remains too incomplete to make the
adjustment accurate (garbage in, garbage out.)
All the moving averages are calculated by assigning equal weight to each day's CFR value. If they were calculated by summing all cases and deaths across the entire 14-day or 35-day period, this would give too much weight to days with more cases and would not be representative of the overall average over the entire 14 or 35 days.
The results of these CFR calculations are charted in age_stratified_cfr.png
.
The short-term adjusted CFR curve helps visualize the magnitude of right censoring: the curve follows the same peaks and valleys as the raw CFR, but deviates more greatly toward the right of the time axis, as censoring increases.
The long-term adjusted CFR curve, especially its last value labelled on the chart, represents our best guess of the age-stratified CFR of COVID-19.
Overall distribution (all ages):
Distribution by age bracket:
gamma.py
calculates the times between onset of symptoms and death
(onset-to-death) and fits them in a Gamma distribution. It creates
multiple charts for different age brackets (gamma*.png
) and outputs a
numerical summary.
The date of onset is known (EventDate
). However FDOH does not publish the
date of death, so gamma.py
infers it from updates made to the line list. We
made daily archives of the line list in directory data_fdoh. Pass these files
as arguments to gamma.py
, it will parse them, and when it detects a new death
(Died
changed to Yes
), it infers the date of death was one day before the
line list was updated, because FDOH updates the line list with data as of the
prior day.
$ ./gamma.py data_fdoh/*.csv
Parsing data_fdoh/2020-06-27-00-00-00.csv
Parsing data_fdoh/2020-06-28-00-00-00.csv
Parsing data_fdoh/2020-06-29-09-51-00.csv
Parsing data_fdoh/2020-06-30-19-26-00.csv
Parsing data_fdoh/2020-07-01-07-49-00.csv
Parsing data_fdoh/2020-07-02-13-38-00.csv
Parsing data_fdoh/2020-07-03-09-16-00.csv
Parsing data_fdoh/2020-07-04-13-44-00.csv
Parsing data_fdoh/2020-07-05-08-44-00.csv
Parsing data_fdoh/2020-07-06-08-18-00.csv
Parsing data_fdoh/2020-07-07-09-06-00.csv
Parsing data_fdoh/2020-07-08-10-21-00.csv
Parsing data_fdoh/2020-07-09-11-09-00.csv
Parsing data_fdoh/2020-07-10-09-46-54.csv
Parsing data_fdoh/2020-07-11-11-04-49.csv
Parsing data_fdoh/2020-07-12-09-53-27.csv
Parsing data_fdoh/2020-07-13-07-57-16.csv
Parsing data_fdoh/2020-07-14-14-09-16.csv
Parsing data_fdoh/2020-07-15-09-09-26.csv
Parsing data_fdoh/2020-07-16-07-58-41.csv
Parsing data_fdoh/2020-07-17-09-17-34.csv
Parsing data_fdoh/2020-07-18-08-58-55.csv
Parsing data_fdoh/2020-07-19-10-12-32.csv
Parsing data_fdoh/2020-07-20-08-30-29.csv
Parsing data_fdoh/2020-07-21-09-22-37.csv
Parsing data_fdoh/2020-07-22-08-26-38.csv
Parsing data_fdoh/2020-07-23-09-44-13.csv
Parsing data_fdoh/2020-07-24-09-32-33.csv
Parsing data_fdoh/2020-07-25-07-38-52.csv
Parsing data_fdoh/2020-07-26-08-01-29.csv
Parsing data_fdoh/2020-07-27-11-17-26.csv
Parsing data_fdoh/2020-07-28-09-08-33.csv
Parsing data_fdoh/2020-07-29-11-28-29.csv
Parsing data_fdoh/2020-07-30-12-31-00.csv
Parsing data_fdoh/2020-07-31-07-45-07.csv
Parsing data_fdoh/2020-08-01-07-45-35.csv
Parsing data_fdoh/2020-08-02-07-44-56.csv
Parsing data_fdoh/2020-08-03-07-38-28.csv
Parsing data_fdoh/2020-08-04-07-57-47.csv
Parsing data_fdoh/2020-08-05-08-38-43.csv
Parsing data_fdoh/2020-08-06-08-29-41.csv
Parsing data_fdoh/2020-08-07-07-57-53.csv
Parsing data_fdoh/2020-08-08-07-54-58.csv
Parsing data_fdoh/2020-08-09-07-39-31.csv
Parsing data_fdoh/2020-08-10-07-53-43.csv
Parsing data_fdoh/2020-08-11-08-01-22.csv
Parsing data_fdoh/2020-08-12-07-52-36.csv
Parsing data_fdoh/2020-08-13-09-40-37.csv
Parsing data_fdoh/2020-08-14-13-14-28.csv
Parsing data_fdoh/2020-08-15-08-01-33.csv
Parsing data_fdoh/2020-08-16-07-41-33.csv
Parsing data_fdoh/2020-08-17-15-00-53.csv
Parsing data_fdoh/2020-08-18-07-42-14.csv
Parsing data_fdoh/2020-08-19-08-02-46.csv
Parsing data_fdoh/2020-08-20-07-53-41.csv
Parsing data_fdoh/2020-08-21-08-06-24.csv
Parsing data_fdoh/2020-08-22-07-43-00.csv
Parsing data_fdoh/2020-08-23-07-55-32.csv
Parsing data_fdoh/2020-08-24-08-00-18.csv
Parsing data_fdoh/2020-08-25-07-53-22.csv
Parsing data_fdoh/2020-08-26-07-53-06.csv
Parsing data_fdoh/2020-08-27-07-50-53.csv
Parsing data_fdoh/2020-08-28-07-43-14.csv
Parsing data_fdoh/2020-08-29-08-08-09.csv
Parsing data_fdoh/2020-08-30-09-58-03.csv
Parsing data_fdoh/2020-08-31-07-43-19.csv
Parsing data_fdoh/2020-09-01-10-43-29.csv
Parsing data_fdoh/2020-09-02-08-04-15.csv
Parsing data_fdoh/2020-09-03-08-08-03.csv
Parsing data_fdoh/2020-09-04-09-08-26.csv
Parsing data_fdoh/2020-09-05-08-02-30.csv
Parsing data_fdoh/2020-09-06-08-48-47.csv
Parsing data_fdoh/2020-09-07-10-23-19.csv
Parsing data_fdoh/2020-09-08-07-51-31.csv
Parsing data_fdoh/2020-09-09-08-02-46.csv
Ages 0-29:
Number of deaths: 59
Gamma distribution params:
mean = 19.8
shape = 1.51
Median: 14.0
Ages 30-39:
Number of deaths: 107
Gamma distribution params:
mean = 22.5
shape = 1.35
Median: 19.0
Ages 40-49:
Number of deaths: 275
Gamma distribution params:
mean = 28.6
shape = 1.99
Median: 23.0
Ages 50-59:
Number of deaths: 698
Gamma distribution params:
mean = 28.0
shape = 2.00
Median: 24.5
Ages 60-69:
Number of deaths: 1562
Gamma distribution params:
mean = 27.1
shape = 2.17
Median: 24.0
Ages 70-79:
Number of deaths: 2508
Gamma distribution params:
mean = 25.3
shape = 2.10
Median: 22.0
Ages 80-89:
Number of deaths: 2814
Gamma distribution params:
mean = 24.1
shape = 1.93
Median: 19.0
Ages 90+:
Number of deaths: 1549
Gamma distribution params:
mean = 22.8
shape = 1.80
Median: 19.0
All ages:
Number of deaths: 9572
Gamma distribution params:
mean = 25.1
shape = 1.97
Median: 21.0
The overall (all ages) mean of 25.1 days is comparable to other published estimates, however our distribution is wider (ie. smaller shape parameter of 1.97) because many deaths occur in the long tail:
We believe our distribution parameters are more accurate because they are based on a much larger sample of 9572 deaths. The long tail may be the result of improved treatments that can maintain patients alive for a longer time.
We notice differences between age brackets. For example patients aged 0-29
years have the shortest onset-to-death time (mean of 12.9 days.) However
for the purpose of calculating the age-stratified CFR (age_stratified_cfr.py
)
and forecasting deaths (forecast_deaths.py
) we do not use age-specific
Gamma parameters. Instead we simply use the parameters of the overall distribution
(all ages.)
Limitations: it is likely our technique does not always accurately identify the
date of death. For example FDOH could be updating the line list many days after
the death occured (contributing to the long tail.) Furthermore, there are data
quality issues in the EventDate
column of the FDOH line list, meaning this
date does not always accurately reflect the date of onset.
heatmap.py
opens the latest line list CSV file (in directory data_fdoh)
or if no such file exists it downloads it from the FDOH line list page.
It creates various heatmaps of cases by age over time:
heatmap.png
the pixel intensity represents the number of casesheatmap_per_capita.png
the pixel intensity represents the number of cases per capitaheatmap_age_share.png
the pixel intensity represents the percentage of cases in the age bracket among all cases in this time periodheatmap.py
also produces a numerical summary:
$ ./heatmap.py
Opening data_fdoh/2020-11-02-10-03-30.csv.gz
Number of COVID-19 cases per 7-day time period in Florida by age bracket over time:
period_start, 00-04, 05-09, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, 85+, median_age
2020-03-02, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 2, 3, 2, 3, 0, 0, 65.5
2020-03-09, 0, 0, 0, 8, 9, 6, 3, 7, 8, 10, 5, 10, 13, 21, 11, 6, 4, 2, 57.0
2020-03-16, 0, 2, 5, 22, 74, 63, 80, 55, 62, 83, 84, 90, 78, 87, 52, 60, 42, 29, 52.0
2020-03-23, 13, 8, 15, 55, 240, 284, 323, 306, 318, 341, 373, 355, 283, 268, 262, 194, 146, 108, 50.0
2020-03-30, 37, 18, 32, 107, 419, 548, 641, 572, 583, 669, 706, 701, 616, 572, 484, 334, 222, 226, 50.0
2020-04-06, 29, 31, 49, 116, 391, 496, 560, 539, 581, 699, 697, 699, 635, 563, 407, 323, 266, 339, 51.0
2020-04-13, 36, 31, 36, 139, 314, 434, 474, 497, 477, 572, 582, 591, 494, 383, 341, 273, 228, 346, 50.0
2020-04-20, 51, 41, 68, 155, 337, 395, 427, 449, 407, 457, 497, 491, 416, 327, 261, 243, 227, 427, 50.0
2020-04-27, 31, 35, 52, 112, 264, 337, 330, 349, 327, 366, 391, 395, 355, 319, 248, 247, 219, 381, 52.0
2020-05-04, 31, 31, 36, 117, 243, 314, 329, 296, 281, 328, 296, 343, 266, 227, 204, 188, 188, 352, 50.0
2020-05-11, 59, 65, 66, 168, 346, 437, 453, 429, 370, 405, 431, 441, 377, 277, 267, 189, 225, 343, 48.0
2020-05-18, 100, 95, 114, 217, 386, 419, 465, 447, 431, 472, 377, 411, 307, 274, 192, 194, 163, 292, 45.0
2020-05-25, 84, 94, 128, 224, 387, 454, 505, 459, 440, 430, 358, 429, 313, 229, 172, 164, 130, 201, 42.0
2020-06-01, 190, 172, 210, 424, 676, 764, 719, 720, 589, 626, 589, 547, 425, 326, 244, 193, 152, 282, 40.0
2020-06-08, 259, 269, 305, 625, 1535, 1370, 1249, 1101, 900, 909, 847, 720, 590, 429, 331, 233, 208, 315, 37.0
2020-06-15, 413, 349, 452, 1361, 3492, 2916, 2434, 1997, 1643, 1623, 1446, 1334, 962, 671, 546, 399, 275, 406, 34.0
2020-06-22, 806, 698, 803, 2875, 6866, 6145, 5035, 4162, 3494, 3379, 3146, 2623, 2006, 1469, 1064, 757, 527, 762, 35.0
2020-06-29, 1086, 1071, 1248, 3541, 7094, 6970, 6337, 5373, 4637, 4559, 4377, 3841, 2867, 2079, 1500, 1089, 699, 819, 36.0
2020-07-06, 1172, 1245, 1632, 4357, 7693, 7822, 7521, 6462, 6140, 6276, 6027, 5492, 4174, 2952, 2178, 1613, 1179, 1486, 39.0
2020-07-13, 1371, 1380, 1928, 4467, 7014, 7287, 7164, 6633, 6367, 6430, 6349, 5731, 4490, 3383, 2569, 1913, 1395, 1719, 41.0
2020-07-20, 1375, 1404, 1979, 3997, 6047, 6193, 6485, 6143, 5810, 6121, 5999, 5410, 4318, 3199, 2384, 1826, 1384, 1762, 41.0
2020-07-27, 1046, 1189, 1599, 3275, 4853, 4968, 5270, 5127, 4667, 5005, 4839, 4793, 3562, 2700, 1995, 1472, 1140, 1439, 42.0
2020-08-03, 756, 910, 1222, 2359, 3517, 3830, 3977, 3992, 3716, 3734, 3702, 3472, 2854, 2111, 1541, 1152, 825, 1212, 42.0
2020-08-10, 706, 804, 1168, 2185, 3132, 3178, 3279, 3247, 3129, 3211, 3282, 3097, 2362, 1773, 1486, 1099, 846, 1242, 43.0
2020-08-17, 528, 590, 808, 1453, 2057, 2198, 2236, 2191, 2020, 2182, 2132, 2125, 1700, 1317, 1016, 787, 635, 786, 43.0
2020-08-24, 402, 480, 642, 1350, 1703, 1568, 1529, 1526, 1465, 1576, 1642, 1500, 1284, 1012, 850, 675, 502, 724, 43.0
2020-08-31, 432, 554, 664, 2668, 3049, 1908, 1883, 1768, 1642, 1763, 1789, 1652, 1435, 1090, 888, 649, 498, 747, 38.0
2020-09-07, 322, 426, 517, 2333, 2378, 1192, 1201, 1178, 1122, 1138, 1152, 1122, 1020, 769, 641, 492, 387, 502, 37.0
2020-09-14, 320, 341, 545, 2347, 2696, 1499, 1458, 1303, 1307, 1348, 1316, 1295, 1043, 827, 701, 550, 451, 556, 37.0
2020-09-21, 284, 366, 547, 1566, 1818, 1238, 1167, 1139, 1019, 1145, 1142, 1185, 963, 700, 629, 477, 325, 367, 39.0
2020-09-28, 307, 392, 538, 1424, 1683, 1338, 1347, 1281, 1208, 1274, 1246, 1139, 1028, 821, 732, 510, 352, 416, 40.0
2020-10-05, 295, 430, 668, 1520, 1891, 1561, 1469, 1384, 1252, 1321, 1286, 1215, 1074, 881, 743, 532, 370, 376, 39.0
2020-10-12, 265, 447, 690, 1948, 2419, 1827, 1605, 1522, 1349, 1525, 1531, 1480, 1194, 955, 818, 530, 325, 383, 38.0
2020-10-19, 354, 561, 861, 2183, 2925, 2457, 2098, 1889, 1728, 1771, 1790, 1777, 1412, 1150, 904, 682, 410, 481, 38.0
2020-10-26, 497, 728, 1072, 2126, 3068, 2829, 2530, 2326, 2149, 2202, 2295, 2240, 1790, 1413, 1176, 822, 487, 527, 39.0
(Last period's data may be incomplete. Age unknown for 1073 out of 812063 cases.)
By default the size of each pixel, or bucket, is 5-year age brackets and 7-day
time periods. This can be changed by editing the variables buckets_ages
and buckets_days
in heatmap.py
.
sort.py
is a tool that strips the ObjectId
column from a line list CSV file
and sorts the rows. This is helpful to compare 2 CSV files published on 2
different days, because the ObjectId
value and the order of rows are not
stable from one file to another.