Add data from CHITREC - Githubissues

danxoneil commented 11 years ago

Descriptions: https://docs.google.com/document/d/1hzEl9gZ8er_xx8fLp4sjvI7qdgB722jVBdRvTNqYt3M/edit?usp=sharing

Data: https://docs.google.com/spreadsheet/ccc?key=0AkpzoQg82DdOdFVsTHdUWGpIU3Q1RE1XUVhESXFyTUE#gid=0

Seems like there is at least one issue with this date-- no time component. If there are any issues, note them here.

JamyiaClark commented 11 years ago

I'm aware that this was assigned to Derek, but how can I view these items? I have to enter Google login information when I click on the links. I have a gmail account, but I received a message that says I need permission to view the material when I entered my personal login information.

derekeder commented 11 years ago

The data contains the following columns, listed by zip code:

Count - total number of patients treated in the zip code
Breast_cancer - Estimated Breast Cancer prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
Colorectal_cancer - Estimated Colorectal Cancer prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
Prostate_cancer - Estimated Prostate Cancer prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
Lung_cancer - Estimated Lung Cancer prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
Diabetes - Estimated diabetes prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
HTN - Estimated hypertension prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
Asthma - Estimated asthma prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
COPD - Estimated Chronic Obstructive Pulmonary Disease (COPD) prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.
CHD - Estimated Congestive Heart Failure (CHF) prevalence in Chicago for adults aged 18-89 based on aggregated Electronic Health Record (EHR) data from a selection of healthcare institutions from 2006 through 2010.

derekeder commented 11 years ago

Issues with this data:

the numbers provided are a composite from 2006 to 2010. is it possible to get data provided on a yearly basis?
the numbers are presented as raw counts instead of rates per 1,000 or 100,000. it will not be meaningful to compare zip codes without performing this calculation.
two zip code entries seem to be malformed or outside chicago: '6761' and '12311'

danxoneil commented 11 years ago

@JamyiaClark really sorry it took so long. You should be able to access the docs now.

RoderickJones commented 11 years ago

Derek, in answer to your issues.

One of the realities of health data is that pooling years is an accepted/preferred method of increasing the stability and meaning of prevalence estimates when counts or populations are small, OR when there is a risk of identifying individuals. For the Chitrec data, my perception is that it has been important to err on the side of caution because of HIPAA and HITECH Act laws that pertain to electronic data collected through health care (which is different in some ways than data collected by a public health department for public health purposes). If the Atlas cannot come up with a way to present mulit-year data, that would be a shame. American Community Survey estimates are very useful, but the smallest geography they are reported in is census tract, and this is only for 5-year periods. IDPH publishes public use datasets that can be analyzed to produce cancer incidence rates by zip code . . . but only for 5 year periods (see http://www.idph.state.il.us/cancer/statistics.htm#P) I remember when Abel first provided the data, because there was variation from year to year, it caused you to interpret those changes as meaningful - they weren't/aren't. Part of our collective responsibility should be to package data in a way that the risk of misinterpretation (or other kinds of harm) is minimized. That's why American Community Survey only publishes tract estimates for socioeconomic indicators in a 5-year batch; same for IDPH and cancer; same for some of the Chicago data sets. I would like us to be on the same page about these concepts, and would be happy to meet and discuss more background and provide more references so you understand pooling years to be a best practice, not a lazy tactic or attempt at obfuscation.
My understanding from the data set descriptions is that what was intended to be reported was the prevalence estimate for each of the conditions, which would be a percentage defined as the # in the column for the condition divided by the # in the first column (count of patients seen), multiplied by 100. I suppose you could ask Chitrec to create the formulas in the excel sheet, but also that is probably something you could do on your end.
6761 refers to the aggregated zip codes area encompassing 60606, 60607, and 60661. 12311 refers to the aggregated zip codes area encompassing 60601, 60602, 60603, 60604, 60605, and 60611. The purpose of aggregating these particular zip codes is related to #1 above - the number of people residing in some of these zips (according to census) is very, very small. Brad Malin is the privacy expert from Vanderbilt who serves as one of the principal investigators on Abel's research team. His recommendation to reduce re-identification risk in the pooled electronic health record data was to "coarsen the zip codes." You may be able to follow the logic by scanning through slides 28-40 of http://tinyurl.com/bu2rkhc. In Arc GIS, the way we aggregate these zips for visualization is simply to assign the same value to the each of the individual zips; if possibly you could also adjust the boundary attributes so the appear to be merged.

danxoneil commented 11 years ago

Did this import contain the defect reffed above ("numbers provided are a composite from 2006 to 2010"), or was it new data?

derekeder commented 11 years ago

After importing, it looks like we are missing data for the following Chicago zip codes:

RoderickJones commented 11 years ago

For a good description of some of the problems that ZIP code presents in this context, see p. 2 of https://data.cityofchicago.org/api/assets/6897A02E-BBE7-469A-8AC2-3BB5D7A4F336 When we have data by ZIP code it has been necessary to figure out how to deal with problems like

A new ZIP code is created. In this case, 60642 didn't exist until recently; it is a carve out of 60622. For consistency over time, a solution is to merge 60642 into 60622. This same scenario is true of 60654 - it is new, it is a carve out of 60610.
Population counts are small, resulting in the need to aggregate with a neighbor ZIP, or suppress. 60633 falls into this category. For City data, we aggregate with 60827. But 60827 falls outside the inclusion criteria set for Abel's database because it is not 606. Therefore it seems it has been suppressed. One way to deal with these scenarios is to create a category called Insufficient data.

derekeder commented 11 years ago

Thanks for this insight Eric. The downtown zip codes have been merged according to your instructions.

danxoneil commented 11 years ago

Based on the explanations above, the fact that the data has been imported, and the fact that CDPH is going to provide better rate ranges in #47, I think this issue can be closed. Please feel free to reopen it if I am wrong :-)

smartchicago / chicago-atlas

Add data from CHITREC #37