Closed jacob-umich closed 6 months ago
There are 34 separate fields in the chronic data set:
Many of them have vague meanings, but it seems like each entry is a question asked to a specific individual like "Diabetes among adults". The features whos meaning we would need to identify in this case would be anything to do with stratification or datavalue
When displaying the entries that do not have NAN in the "response" column, an empty data frame is returned. There are 109 different questions asked. The most frequent question is "Binge drinking frequency among adults who binge drink" at 5720 instances. Many answers seem like they should be true or false, such as "Food insecure in the past 12 months among households" or "No broadband internet subscription among households" or it could be the number of yes answers from an aggregated survey. Because of the year start and year end features, it might be that the answer is the number of incidents in that given time period
This table is only the entries that had the question "Food insecure in the past 12 months among households". So, it looks like all the entries are aggregated answers for questions asked/ data collected over a period of time limited to a certain geographic area (states). Based on this, we should choose a dataset that relates to one of the questions asked at least.
YearStart | YearEnd | LocationAbbr | DataValueUnit | DataValueType | DataValue | |
---|---|---|---|---|---|---|
25076 | 2019 | 2021 | AK | % | Crude Prevalence | 9.5 |
18425 | 2019 | 2021 | AL | % | Crude Prevalence | 13.1 |
30716 | 2019 | 2021 | AR | % | Crude Prevalence | 15 |
18702 | 2019 | 2021 | AZ | % | Crude Prevalence | 10.1 |
26942 | 2019 | 2021 | CA | % | Crude Prevalence | 9.6 |
30776 | 2019 | 2021 | CO | % | Crude Prevalence | 10.5 |
28062 | 2019 | 2021 | CT | % | Crude Prevalence | 9.6 |
29477 | 2019 | 2021 | DC | % | Crude Prevalence | 9 |
38811 | 2019 | 2021 | DE | % | Crude Prevalence | 11.2 |
31599 | 2019 | 2021 | FL | % | Crude Prevalence | 9.9 |
32639 | 2019 | 2021 | GA | % | Crude Prevalence | 9.9 |
31072 | 2019 | 2021 | GU | % | Crude Prevalence | nan |
51521 | 2019 | 2021 | HI | % | Crude Prevalence | 9.1 |
45921 | 2019 | 2021 | IA | % | Crude Prevalence | 7 |
44209 | 2019 | 2021 | ID | % | Crude Prevalence | 9.8 |
43543 | 2019 | 2021 | IL | % | Crude Prevalence | 9.4 |
50978 | 2019 | 2021 | IN | % | Crude Prevalence | 9.7 |
46560 | 2019 | 2021 | KS | % | Crude Prevalence | 10.2 |
40045 | 2019 | 2021 | KY | % | Crude Prevalence | 12.3 |
42348 | 2019 | 2021 | LA | % | Crude Prevalence | 14.5 |
65876 | 2019 | 2021 | MA | % | Crude Prevalence | 8.4 |
60423 | 2019 | 2021 | MD | % | Crude Prevalence | 8.7 |
59576 | 2019 | 2021 | ME | % | Crude Prevalence | 9.5 |
65847 | 2019 | 2021 | MI | % | Crude Prevalence | 11.4 |
58773 | 2019 | 2021 | MN | % | Crude Prevalence | 7.4 |
57794 | 2019 | 2021 | MO | % | Crude Prevalence | 12 |
65500 | 2019 | 2021 | MS | % | Crude Prevalence | 15.3 |
53444 | 2019 | 2021 | MT | % | Crude Prevalence | 10.4 |
66918 | 2019 | 2021 | NC | % | Crude Prevalence | 10.9 |
66152 | 2019 | 2021 | ND | % | Crude Prevalence | 7.7 |
77580 | 2019 | 2021 | NE | % | Crude Prevalence | 10.6 |
77195 | 2019 | 2021 | NH | % | Crude Prevalence | 5.4 |
67001 | 2019 | 2021 | NJ | % | Crude Prevalence | 8.3 |
70966 | 2019 | 2021 | NM | % | Crude Prevalence | 11.5 |
75368 | 2019 | 2021 | NV | % | Crude Prevalence | 10.2 |
70292 | 2019 | 2021 | NY | % | Crude Prevalence | 10.3 |
82520 | 2019 | 2021 | OH | % | Crude Prevalence | 10.8 |
85652 | 2019 | 2021 | OK | % | Crude Prevalence | 13.8 |
87941 | 2019 | 2021 | OR | % | Crude Prevalence | 10.3 |
83502 | 2019 | 2021 | PA | % | Crude Prevalence | 9.2 |
89665 | 2019 | 2021 | PR | % | Crude Prevalence | nan |
84235 | 2019 | 2021 | RI | % | Crude Prevalence | 8.4 |
91723 | 2019 | 2021 | SC | % | Crude Prevalence | 12.6 |
88092 | 2019 | 2021 | SD | % | Crude Prevalence | 8.7 |
102851 | 2019 | 2021 | TN | % | Crude Prevalence | 11.2 |
92585 | 2019 | 2021 | TX | % | Crude Prevalence | 13.7 |
99930 | 2019 | 2021 | US | % | Crude Prevalence | 10.4 |
104635 | 2019 | 2021 | UT | % | Crude Prevalence | 11.2 |
94758 | 2019 | 2021 | VA | % | Crude Prevalence | 7.8 |
104227 | 2019 | 2021 | VI | % | Crude Prevalence | nan |
93287 | 2019 | 2021 | VT | % | Crude Prevalence | 7.9 |
98754 | 2019 | 2021 | WA | % | Crude Prevalence | 7.9 |
105791 | 2019 | 2021 | WI | % | Crude Prevalence | 9.9 |
117168 | 2019 | 2021 | WV | % | Crude Prevalence | 14 |
109942 | 2019 | 2021 | WY | % | Crude Prevalence | 11.2 |
Here are some of the other features
LowConfidenceLimit | Stratification1 | StratificationCategory1 | Geolocation | LocationID | DataValue | |
---|---|---|---|---|---|---|
18425 | 9.4 | Overall | Overall | POINT (-86.63186076199969 32.84057112200048) | 1 | 13.1 |
18702 | 7.1 | Overall | Overall | POINT (-111.76381127699972 34.865970280000454) | 4 | 10.1 |
25076 | 6.3 | Overall | Overall | POINT (-147.72205903599973 64.84507995700051) | 2 | 9.5 |
26942 | 8.6 | Overall | Overall | POINT (-120.99999953799971 37.63864012300047) | 6 | 9.6 |
28062 | 6.2 | Overall | Overall | POINT (-72.64984095199964 41.56266102000046) | 9 | 9.6 |
29477 | 6.8 | Overall | Overall | POINT (-77.036871 38.907192) | 11 | 9 |
30716 | 11.6 | Overall | Overall | POINT (-92.27449074299966 34.74865012400045) | 5 | 15 |
30776 | 7.2 | Overall | Overall | POINT (-106.13361092099967 38.843840757000464) | 8 | 10.5 |
31072 | nan | Overall | Overall | POINT (144.793731 13.444304) | 66 | nan |
31599 | 8.2 | Overall | Overall | POINT (-81.92896053899966 28.932040377000476) | 12 | 9.9 |
32639 | 7.5 | Overall | Overall | POINT (-83.62758034599966 32.83968109300048) | 13 | 9.9 |
38811 | 7.5 | Overall | Overall | POINT (-75.57774116799965 39.008830667000495) | 10 | 11.2 |
40045 | 9.2 | Overall | Overall | POINT (-84.77497104799966 37.645970271000465) | 21 | 12.3 |
42348 | 11.7 | Overall | Overall | POINT (-92.44568007099969 31.31266064400046) | 22 | 14.5 |
43543 | 7.4 | Overall | Overall | POINT (-88.99771017799969 40.48501028300046) | 17 | 9.4 |
44209 | 7.4 | Overall | Overall | POINT (-114.3637300419997 43.682630005000476) | 16 | 9.8 |
45921 | 4.2 | Overall | Overall | POINT (-93.81649055599968 42.46940091300047) | 19 | 7 |
46560 | 7.7 | Overall | Overall | POINT (-98.20078122699965 38.34774030000045) | 20 | 10.2 |
50978 | 7.4 | Overall | Overall | POINT (-86.14996019399968 39.766910452000445) | 18 | 9.7 |
51521 | 6.4 | Overall | Overall | POINT (-157.85774940299973 21.304850435000446) | 15 | 9.1 |
53444 | 7.8 | Overall | Overall | POINT (-109.42442064499971 47.06652897200047) | 30 | 10.4 |
57794 | 8.7 | Overall | Overall | POINT (-92.56630005299968 38.635790776000476) | 29 | 12 |
58773 | 4.8 | Overall | Overall | POINT (-94.79420050299967 46.35564873600049) | 27 | 7.4 |
59576 | 6.6 | Overall | Overall | POINT (-68.98503133599962 45.254228894000505) | 23 | 9.5 |
60423 | 6.1 | Overall | Overall | POINT (-76.60926011099963 39.29058096400047) | 24 | 8.7 |
65500 | 11.7 | Overall | Overall | POINT (-89.53803082499968 32.745510099000455) | 28 | 15.3 |
65847 | 8.7 | Overall | Overall | POINT (-84.71439026999968 44.6613195430005) | 26 | 11.4 |
65876 | 6.3 | Overall | Overall | POINT (-72.08269067499964 42.27687047000046) | 25 | 8.4 |
66152 | 5.5 | Overall | Overall | POINT (-100.11842104899966 47.47531977900047) | 38 | 7.7 |
66918 | 8.3 | Overall | Overall | POINT (-79.15925046299964 35.466220975000454) | 37 | 10.9 |
67001 | 6.1 | Overall | Overall | POINT (-74.27369128799967 40.13057004800049) | 34 | 8.3 |
70292 | 8.5 | Overall | Overall | POINT (-75.54397042699964 42.82700103200045) | 36 | 10.3 |
70966 | 7.1 | Overall | Overall | POINT (-106.24058098499967 34.52088095200048) | 35 | 11.5 |
75368 | 7.7 | Overall | Overall | POINT (-117.07184056399967 39.493240390000494) | 32 | 10.2 |
77195 | 3.5 | Overall | Overall | POINT (-71.50036091999965 43.65595011300047) | 33 | 5.4 |
77580 | 7.8 | Overall | Overall | POINT (-99.36572062299967 41.6410409880005) | 31 | 10.6 |
82520 | 8.6 | Overall | Overall | POINT (-82.40426005599966 40.06021014100048) | 39 | 10.8 |
83502 | 7.2 | Overall | Overall | POINT (-77.86070029399963 40.79373015200048) | 42 | 9.2 |
84235 | 5.4 | Overall | Overall | POINT (-71.52247031399963 41.70828019300046) | 44 | 8.4 |
85652 | 10.2 | Overall | Overall | POINT (-97.52107021399968 35.47203135600046) | 40 | 13.8 |
87941 | 7.7 | Overall | Overall | POINT (-120.15503132599969 44.56744942400047) | 41 | 10.3 |
88092 | 6.2 | Overall | Overall | POINT (-100.3735306369997 44.353130053000484) | 46 | 8.7 |
89665 | nan | Overall | Overall | POINT (-66.590149 18.220833) | 72 | nan |
91723 | 9.9 | Overall | Overall | POINT (-81.04537120699968 33.998821303000454) | 45 | 12.6 |
92585 | 11.9 | Overall | Overall | POINT (-99.42677020599967 31.827240407000488) | 48 | 13.7 |
93287 | 5.8 | Overall | Overall | POINT (-72.51764079099962 43.62538123900049) | 50 | 7.9 |
94758 | 5.7 | Overall | Overall | POINT (-78.45789046299967 37.54268067400045) | 51 | 7.8 |
98754 | 6.1 | Overall | Overall | POINT (-120.47001078999972 47.52227862900048) | 53 | 7.9 |
99930 | 10.1 | Overall | Overall | nan | 59 | 10.4 |
102851 | 9 | Overall | Overall | POINT (-85.77449091399967 35.68094058000048) | 47 | 11.2 |
104227 | nan | Overall | Overall | POINT (-64.896335 18.335765) | 78 | nan |
104635 | 8.5 | Overall | Overall | POINT (-111.58713063499971 39.360700171000474) | 49 | 11.2 |
105791 | 6.9 | Overall | Overall | POINT (-89.81637074199966 44.39319117400049) | 55 | 9.9 |
109942 | 8.6 | Overall | Overall | POINT (-108.10983035299967 43.23554134300048) | 56 | 11.2 |
117168 | 8.9 | Overall | Overall | POINT (-80.71264013499967 38.66551020200046) | 54 | 14 |
It looks like the stratifications are ways to group the data, like by age or sex
Question | Instances | |
---|---|---|
0 | Binge drinking frequency among adults who binge drink | 5720 |
1 | Binge drinking intensity among adults who binge drink | 5680 |
2 | Diabetic ketoacidosis mortality among all people, underlying or contributing cause | 5616 |
3 | Diseases of the heart mortality among all people, underlying cause | 5616 |
4 | Coronary heart disease mortality among all people, underlying cause | 5616 |
5 | Cerebrovascular disease (stroke) mortality among all people, underlying cause | 5616 |
6 | Diabetes mortality among all people, underlying or contributing cause | 5616 |
7 | Chronic liver disease mortality among all people, underlying cause | 5616 |
8 | Asthma mortality among all people, underlying cause | 5616 |
9 | Chronic obstructive pulmonary disease mortality among adults aged 45 years and older, underlying or contributing cause | 5304 |
10 | Chronic obstructive pulmonary disease mortality among adults aged 45 years and older, underlying cause | 5304 |
11 | Diabetes among adults | 5060 |
12 | Chronic obstructive pulmonary disease among adults | 5060 |
13 | Routine checkup within the past year among adults | 5060 |
14 | Depression among adults | 5060 |
15 | Recent activity limitation among adults | 5060 |
16 | Current smoking among adults with chronic obstructive pulmonary disease | 5060 |
17 | Current cigarette smoking among adults | 5060 |
18 | 2 or more chronic conditions among adults | 5060 |
19 | Fair or poor self-rated health status among adults | 5060 |
20 | Frequent mental distress among adults | 5060 |
21 | Frequent physical distress among adults | 5060 |
22 | Binge drinking prevalence among adults | 5060 |
23 | Average recent physically unhealthy days among adults | 5060 |
24 | Average mentally unhealthy days among adults | 5060 |
25 | Obesity among adults | 5060 |
26 | No leisure-time physical activity among adults | 5060 |
27 | Influenza vaccination among adults | 5060 |
28 | Adults with any disability | 5060 |
29 | Quit attempts in the past year among adult current smokers | 5060 |
30 | Current asthma among adults | 4895 |
31 | Influenza vaccination among adults 18�64 who are at increased risk | 4840 |
32 | Lack of health insurance among adults aged 18-64 years | 4840 |
33 | Pneumococcal vaccination among adults aged 18�64 years who are at increased risk | 4840 |
34 | Pneumococcal vaccination among adults aged 65 years and older | 4400 |
35 | Arthritis among adults | 3795 |
36 | Physical inactivity among adults with arthritis | 3795 |
37 | Hospitalization for heart failure as principal diagnosis, Medicare-beneficiaries aged 65 years and older | 3744 |
38 | Hospitalization for chronic obstructive pulmonary disease as principal diagnosis, Medicare-beneficiaries aged 65 years and older | 3744 |
39 | Hospitalization for chronic obstructive pulmonary disease as any diagnosis, Medicare-beneficiaries aged 65 years and older | 3744 |
40 | Invasive cancer (all sites combined), incidence | 2544 |
41 | Consumed vegetables less than one time daily among adults | 2530 |
42 | Visited dentist or dental clinic in the past year among adults | 2530 |
43 | Taking medicine to control high blood pressure among adults with high blood pressure | 2530 |
44 | High cholesterol among adults who have been screened | 2530 |
45 | Have taken an educational class to learn how to manage arthritis symptoms among adults with arthritis | 2530 |
46 | Taking medicine for high cholesterol among adults | 2530 |
47 | Short sleep duration among adults | 2530 |
48 | Consumed fruit less than one time daily among adults | 2530 |
49 | Received health care provider counseling for physical activity among adults with arthritis | 2530 |
50 | Provided care for a friend or family member in the past month among adults | 2530 |
51 | Provided care for someone with dementia or other cognitive impairment in the past month among adults | 2530 |
52 | High blood pressure among adults | 2527 |
53 | Breast cancer mortality among all females, underlying cause | 2496 |
54 | Cervical cancer mortality among all females, underlying cause | 2496 |
55 | Colon and rectum (colorectal) cancer mortality among all people, underlying cause | 2496 |
56 | Lung and bronchial cancer mortality among all people, underlying cause | 2496 |
57 | Prostate cancer mortality among all males, underlying cause | 2496 |
58 | Invasive cancer (all sites combined) mortality among all people, underlying cause | 2496 |
59 | Subjective cognitive decline among adults aged 45 years and older | 2422 |
60 | Discussed symptoms of subjective cognitive decline with a health care professional among adults aged 45 years and older with subjective cognitive decline | 2422 |
61 | No teeth lost among adults aged 18-64 years | 2420 |
62 | Severe joint pain among adults with arthritis | 2420 |
63 | Work limitation due to arthritis among adults aged 18-64 years with arthritis | 2420 |
64 | Activity limitation due to arthritis among adults with arthritis | 2420 |
65 | All teeth lost among adults aged 65 years and older | 2200 |
66 | Colorectal cancer screening among adults aged 45-75 years | 2200 |
67 | Six or more teeth lost among adults aged 65 years and older | 2200 |
68 | Mammography use among women aged 50-74 years | 1758 |
69 | Binge drinking prevalence among high school students | 1540 |
70 | Current electronic vapor product use among high school students | 1540 |
71 | Consumed regular soda at least one time daily among high school students | 1540 |
72 | Current smokeless tobacco use among high school students | 1540 |
73 | Consumed fruit less than one time daily among high school students | 1540 |
74 | Current tobacco use of any tobacco product among high school students | 1540 |
75 | Short sleep duration among high school students | 1540 |
76 | Obesity among high school students | 1540 |
77 | Consumed vegetables less than one time daily among high school students | 1540 |
78 | Met aerobic physical activity guideline among high school students | 1540 |
79 | Alcohol use among high school students | 1540 |
80 | Receipt of evidence-based preventive dental services in the past 12 months among children and adolescents aged 1-17 years | 1430 |
81 | Visited dentist or other oral health care provider in the past 12 months among children and adolescents aged 1-17 years | 1430 |
82 | Unable to pay mortgage, rent, or utility bills in the past 12 months among adults | 1265 |
83 | Met aerobic physical activity guideline for substantial health benefits, adults | 1265 |
84 | Lack of social and emotional support needed among adults | 1265 |
85 | Lack of reliable transportation in the past 12 months among adults | 1265 |
86 | Short sleep duration among children aged 4 months to 14 years | 1248 |
87 | Children and adolescents aged 6-13 years meeting aerobic physical activity guideline | 1248 |
88 | Unemployment rate among people 16 years and older in the labor force | 1040 |
89 | Living below 150% of the poverty threshold among all people | 1040 |
90 | High school completion among adults aged 18-24 | 1040 |
91 | Cigarette smoking during pregnancy among women with a recent live birth | 1026 |
92 | Preventive dental care in the 12 months before pregnancy among women with a recent live birth | 1026 |
93 | Postpartum depressive symptoms among women with a recent live birth | 1026 |
94 | Postpartum checkup among women with a recent live birth | 1026 |
95 | Gestational diabetes among women with a recent live birth | 1026 |
96 | Health insurance coverage after pregnancy among women with a recent live birth | 1026 |
97 | Health insurance coverage in the month before pregnancy among women with a recent live birth | 1026 |
98 | Cervical cancer screening among women aged 21-65 years | 880 |
99 | Current poor mental health among high school students | 770 |
100 | Obesity among WIC children aged 2 to 4 years | 432 |
101 | Life expectancy at birth | 312 |
102 | Proportion of the population protected by a comprehensive smoke-free policy prohibiting smoking in all indoor areas of workplaces and public places, including restaurants and bars | 165 |
103 | Per capita alcohol consumption among people aged 14 years and older | 165 |
104 | Infants who were breastfed at 12 months | 122 |
105 | Infants who were exclusively breastfed through 6 months | 122 |
106 | No broadband internet subscription among households | 104 |
107 | Incidence of treated end-stage kidney disease | 104 |
108 | Food insecure in the past 12 months among households | 55 |
I think there are a lot of candidates for augmenting datasets
I think I will incorporate the data indicated with asterisks because it seems the most related if not overlappign
I also added some data from 538 because they have data on the state-level. These include:
Here is a table of all the features from each incorporated dataset. To connect the datasets we will need to identify keys/features that can be related. Below is a list of how each dataset can be connected with the main one. Also, there is a list of how some features will be changed.
main | urbanization_dist | food_prices | nutrition | human_capital | metro_grade | sots_index | sots_words | urbanization_state | |
---|---|---|---|---|---|---|---|---|---|
0 | YearStart | stcd | Classification Name | Country Name | Country Name | metro_area | state | phrase | state |
1 | YearEnd | state | Classification Code | Country Code | Country Code | holc_grade | governor | category | urbanindex |
2 | LocationAbbr | cd | Country Name | Series Name | Series Name | white_pop | party | d_speeches | |
3 | LocationDesc | pvi_22 | Country Code | Series Code | Series Code | black_pop | filename | r_speeches | |
4 | DataSource | urbanindex | Series Name | 1960 [YR1960] | 2010 [YR2010] | hisp_pop | url | total | |
5 | Topic | rural | Series Code | 1961 [YR1961] | 2011 [YR2011] | asian_pop | percent_of_d_speeches | ||
6 | Question | exurban | 2017 [YR2017] | 1962 [YR1962] | 2012 [YR2012] | other_pop | percent_of_r_speeches | ||
7 | Response | suburban | 2018 [YR2018] | 1963 [YR1963] | 2013 [YR2013] | total_pop | chi2 | ||
8 | DataValueUnit | urban | 2019 [YR2019] | 1964 [YR1964] | 2014 [YR2014] | pct_white | pval | ||
9 | DataValueType | grouping | 2020 [YR2020] | 1965 [YR1965] | 2015 [YR2015] | pct_black | |||
10 | DataValue | 2021 [YR2021] | 1966 [YR1966] | 2016 [YR2016] | pct_hisp | ||||
11 | DataValueAlt | 1967 [YR1967] | 2017 [YR2017] | pct_asian | |||||
12 | DataValueFootnoteSymbol | 1968 [YR1968] | 2018 [YR2018] | pct_other | |||||
13 | DataValueFootnote | 1969 [YR1969] | 2019 [YR2019] | lq_white | |||||
14 | LowConfidenceLimit | 1970 [YR1970] | 2020 [YR2020] | lq_black | |||||
15 | HighConfidenceLimit | 1971 [YR1971] | lq_hisp | ||||||
16 | StratificationCategory1 | 1972 [YR1972] | lq_asian | ||||||
17 | Stratification1 | 1973 [YR1973] | lq_other | ||||||
18 | StratificationCategory2 | 1974 [YR1974] | surr_area_white_pop | ||||||
19 | Stratification2 | 1975 [YR1975] | surr_area_black_pop | ||||||
20 | StratificationCategory3 | 1976 [YR1976] | surr_area_hisp_pop | ||||||
21 | Stratification3 | 1977 [YR1977] | surr_area_asian_pop | ||||||
22 | Geolocation | 1978 [YR1978] | surr_area_other_pop | ||||||
23 | LocationID | 1979 [YR1979] | surr_area_pct_white | ||||||
24 | TopicID | 1980 [YR1980] | surr_area_pct_black | ||||||
25 | QuestionID | 1981 [YR1981] | surr_area_pct_hisp | ||||||
26 | ResponseID | 1982 [YR1982] | surr_area_pct_asian | ||||||
27 | DataValueTypeID | 1983 [YR1983] | surr_area_pct_other | ||||||
28 | StratificationCategoryID1 | 1984 [YR1984] | |||||||
29 | StratificationID1 | 1985 [YR1985] | |||||||
30 | StratificationCategoryID2 | 1986 [YR1986] | |||||||
31 | StratificationID2 | 1987 [YR1987] | |||||||
32 | StratificationCategoryID3 | 1988 [YR1988] | |||||||
33 | StratificationID3 | 1989 [YR1989] | |||||||
34 | 1990 [YR1990] | ||||||||
35 | 1991 [YR1991] | ||||||||
36 | 1992 [YR1992] | ||||||||
37 | 1993 [YR1993] | ||||||||
38 | 1994 [YR1994] | ||||||||
39 | 1995 [YR1995] | ||||||||
40 | 1996 [YR1996] | ||||||||
41 | 1997 [YR1997] | ||||||||
42 | 1998 [YR1998] | ||||||||
43 | 1999 [YR1999] | ||||||||
44 | 2000 [YR2000] | ||||||||
45 | 2001 [YR2001] | ||||||||
46 | 2002 [YR2002] | ||||||||
47 | 2003 [YR2003] | ||||||||
48 | 2004 [YR2004] | ||||||||
49 | 2005 [YR2005] | ||||||||
50 | 2006 [YR2006] | ||||||||
51 | 2007 [YR2007] | ||||||||
52 | 2008 [YR2008] | ||||||||
53 | 2009 [YR2009] | ||||||||
54 | 2010 [YR2010] | ||||||||
55 | 2011 [YR2011] | ||||||||
56 | 2012 [YR2012] | ||||||||
57 | 2013 [YR2013] | ||||||||
58 | 2014 [YR2014] | ||||||||
59 | 2015 [YR2015] | ||||||||
60 | 2016 [YR2016] | ||||||||
61 | 2017 [YR2017] | ||||||||
62 | 2018 [YR2018] | ||||||||
63 | 2019 [YR2019] | ||||||||
64 | 2020 [YR2020] | ||||||||
65 | 2021 [YR2021] | ||||||||
66 | 2022 [YR2022] |
the following changes will be made to the features:
I added a file from colab that is looking at the states scoring the "worst" in SDOH categories vs life expectancy (sorry I still don't know how to make files like yours).
Also, what do you mean by features? Is that just the column you are merging on?
Each SDOH is looking at five worst states for each category, then counting the number of times they are the worst, then comparing that with 10 lowest states for life expectancy (my thought worst SDOH=lower life exectancy) which is kind of true, but not the not appealing graph I've ever seen.
I think the SDOH avenue has a lot of other potential correlations we can make though, like avg income per household, maternal mortality rates, education performance. Theres also a few questions on insurance covered before/after birth and we could look at that vs maternal mortality rates. Maybe political leanings if we can find a good survey. Let me know what you think! I added a life expectancy to the data set.
@Aerlenbeck
Also, what do you mean by features? Is that just the column you are merging on?
yea i think for the most part a feature is the same as a column
Yea i think there are a lot of good analyses that can come from that. Where did you get the life expectancy data from? did you get a chance to look at any of the other data sets?
@jacob-umich
Life expectancy data is from CDC, not one of her sources for data but I think it should be fine since they are reliable.
I see the datasets and how you plan to link them but I don't see the correlations you're trying to make, I'm not sure if this is in one of the txt files you uploaded here?
Is urbanization just giving a number value where higher is more urbanization per state? I also think we could incorporate that with SDOH (more urbanization better SDOH) and or AVG state income (more urbanization, higher average income?). Maybe we could find some traffic safety data too (more urbanization more car deaths?)
Food prices and nutrition are hard because they're both per year but we could see are food prices trending up? When food prices go up does nutrition go down?
I don't understand the data for metro_grade and redlining, what do these columns mean?
@Aerlenbeck I mainly explained the links in the comments above. I guess it would be simpler to just look at life expectancy. Lets just go with that. If we feel like we are running out coorelations to make, we can bring in the other data sets.
I don't understand the data for metro_grade and redlining, what do these columns mean?
metro_grade measures a degree of segregation from redlining practices https://projects.fivethirtyeight.com/redlining/
@jacob-umich I see your correlations now. I think those would all work well too. We need 15 correlations so I think we'll have enough space for everything.
Has all the data been cleaned now? Just wondering what the next step is and where I can start contributing.
@tsivitse yea I finished cleaning the main dataset. I think we decided to just use ths augmenting data set that @Aerlenbeck found. That needs to be cleaned I think. You can follow what I did for cleaning that. Then we just need to make some plots
@tsivitse @jacob-umich Yea I added a life expectancy dataset to the data folder on google drive, but I also cleaned that data in the .ipynb file (should be towards the bottom) I added to the EDA section. It's mostly just swapping state codes for their full name so the table can be merged with the CDI dataset.
@tsivitse We need more diverse plot types, if you're looking for something to think about. I am going to make bar charts (12) with a drop down menu for each SDOH question from the CDI dataset for best 5/worst 5 states for each question, but those will all be one plot type. I'm also planning to do a chloropleth plot for the life expectancy.
With the food prices/nutrition data we could probably do line/scatter plots, but I think we need 5 plots types, so at least 1 maybe 2 more if you see anything that stands out to you.
Confirming if the EDA I'm working on should be in the SDOH file with your cleaned data @Aerlenbeck?
Yea you can add to that file, if I'm understanding your question correctly @tsivitse
I have compiled all the clean data into one database. The server can use this database instead of pulling from two separate ones. This task is done
Background
For this project, we need to use the chronic disease dataset and augment it with another dataset. The two datasets must be somewhat related for them to be used together. This task will involve choosing the data set to augment the chronic disease data set and writing scripts to clean both data sets. This work will be completed in a branch called "eda"
Tasks