Open mdnez opened 1 year ago
Discovering the Predictive Attributes of Diabetes Mellitus in the Pima Indian Population
Maurica Nez Colorado State University Global MIS581: Business Intelligence and Data Analytics Instructor: Dr. Morad Due Date: 10/1/2023 Module 7: Portfolio Milestone 2
Abstract Diabetes is a disease ailing and killing millions of people around the world every year. Medical costs worldwide are projected to reach $627.3 billion by 2035 (Zhang & Liu, 2022). Type II diabetes mellitus may be preventable. This research investigates the predictive ability of variables from the Pima Indians Diabetes data set using data mining techniques. What variables are correlated with diabetes? What variables can predict diabetes? Using statistical analysis software, this research found correlations with 4 variables. The predictive analysis confirmed the correlations. A classification tree model selected the same 4 variables between both partitions of the data. The predictive model can determine the presence of disease based on glucose levels, pregnancy, age, and insulin levels. This data was limited to a few hundred participants once the data was cleaned. A larger data set would be ideal for further analysis.
INTRODUCTION The World’s Health Organization has declared diabetes to be a serious global issue (WHO, 2016). Diabetes Mellitus is one of the top four “priority noncommunicable diseases (NCDs) targeted for action by world leaders” (WHO, 2016, p. 6). Diabetes mellitus “is the leading cause of kidney failure” (Vital Signs, 2018, p. 1). In 2012, diabetes killed 1.5 million people and was associated with an additional “2.2 million deaths, by increasing the risks of cardiovascular and other diseases” (WHO, 2016, p.6). The instances of diabetes in the world population have almost quadrupled “since 1980” (WHO, 2016, p. 6). The pathology of diabetes mellitus is dependent on the type. Cells rely on basic sugar to provide energy to operate at optimum levels. Insulin facilitates the transport of sugar into the cell. The absence of insulin within the cells prevents cells from receiving sugar and the cells die due to lack of energy (Klandorf & Stark, 2022). Sugar has limited places to go, and the kidneys become the primary mechanism for the body to remove the sugar from the blood stream. The kidneys then become overtaxed and cannot effectively perform their primary function to regulate the blood. Once the kidneys succumb to the disease, treatment options are limited to kidney dialysis or transplant (Vital Signs, 2018). There are three types of diabetes mellitus. Type 1 diabetes mellitus is not preventable because it is hereditary and is triggered by environmental factors (Klandorf & Stark, 2022). Children have the highest instances of type 1 diabetes mellitus. The mechanism is autoimmune related where the insulin producing cells in the pancreas are destroyed by the body’s T lymphocytes (Klandorf & Start, 2022). Insulin is no longer produced, and sugar is unable to transport into the cell. There is no cure for type 1 diabetes mellitus. However, it is treatable with insulin injections (Klandorf & Stark, 2018). Type 2 diabetes mellitus is the opposite of type 1 and is more common. Instead of an absence of insulin, too much insulin is produced causing the cells to reduce the number of receptors (Klandorf & Stark, 2022). The cells become insulin resistant due to the lack of transport receptors. The initial symptoms are less pronounced when compared to type 1 and early detection is difficult (Chang et al., 2022). Therefore, type 2 has been labeled “the silent killer” (Klandorf & Stark, 2022, p. 1). Type 2 diabetes does not have a cure, but it is treatable. The advancement of type 2 diabetes mellitus symptoms may be slowed, and kidney failure may be preventable (Chang et al., 2022). However, the difficulty lies in the timing of discovering the condition because it may take a decade to present with symptoms (Abbasi et al., 2016). Treatment for type 2 diabetes mellitus is more effective the earlier that it is discovered (WHO, 2016). Gestational diabetes is a temporary condition that appears during pregnancy. The risk for developing type 2 diabetes mellitus later increases for those who have suffered from gestational diabetes. Moreover, the child from the affected pregnancy also has increased risk of developing type 2 diabetes during childhood and suffering from obesity (Klandorf & Stark, 2022). Studies indicate that the native American population has the highest instances of diabetes when compared to other populations (Vital Signs, 2018). A possible explanation includes genetic predisposition, a lack of healthcare, education, and the cultural food paradigm. The Native American population is in crisis attempting to manage the treatment for individuals affected because access to healthcare is expensive and limited. Understanding the condition and trusting treatment is another barrier that must be overcome to improve the situation. Within the Native American tribes, food is linked to family and celebration of togetherness. The food often prepared is what their ancestors were fed when held in captivity from commodities provided by the American government. The commodities became considered their traditional food. Eating poorly became a generational learned behavior. Diabetes Mellitus is a global problem that has no cure (Anonymous, 2002). However, the earlier the discovery, the more positive the prognosis (WHO, 2016). Treatment is possible for type 2 diabetes mellitus. However, finding a means to prevent the condition would be ideal (Anonymous, 2002). Identifying the risk factors of Type 2 Diabetes Mellitus is a preliminary step in the research process. Any research that supports combating a worldwide problem is worth pursuing. The Pima Indian Diabetes data set is selected to represent the Native American tribes affected by diabetes mellitus. The tribe has been greatly impacted by the effects of the disease and has data readily available for research purposes. OBJECTIVES • To identify what risk factors may be used to predict the occurrence of Type 2 Diabetes Mellitus. • To explore the variables of the data set. • To discover what variables are correlated with the outcome variable. • To determine if the correlated variables can predict the outcome variable. • To determine what additional variables would be helpful for future analysis. • To identify what variable may accommodate preventative measures a community may take to address the determined variables. OVERVIEW OF STUDY This research will use data mining techniques to investigate the variables of the Pima Indians Diabetes data set. Descriptive and predictive analysis will be performed using three different analytical tools. The distribution of the data will be analyzed using Tableau and SAS. The R programming language will be used to clean, partition the data, and to create a classification tree. The train and test model will be compared to one another using sensitivity and specificity testing. The correlation matrix and models will identify what variables can predict diabetes by answering each research question and determine support for hypothesis statements.
Research Questions This research seeks to determine what variables in the data set have a correlation and predictive relationship with the outcome variable. Both research questions will explore the nature of the relationship between the independent and dependent variables.
Figure 1 Data Dictionary
Methods The descriptive analysis will be conducted in Tableau and SAS to determine the data distribution visualizing histograms. To test the first set of hypotheses statements, a correlation analysis will be conducted using RStudio to determine the correlation coefficient value of each independent variable. For the second set of hypotheses statements, the train and test partitions will be used to create a classification tree using RStudio. A confusion matrix will test the sensitivity and specificity of the models. The variables the classification tree selects will be compared to the variables found in the correlation matrix to determine which variables predict diabetes. Limitations The data set contains variables that have been examined using various methods. However, the literature does encourage the continued study of the variables using multiple models. Using data mining techniques appears to be relatively new for analyzing the data set. Very few articles referenced data mining and machine learning techniques and such research was relatively new. More models, combinations of models, more data sets, and different variables need to be explored for future research.
Ethical Considerations The primary ethical concern of this project is that the participants provided informed consent that allowed the use of their data (O’Leary, 2017). There is no indication in the data set documentation that the participants were uninformed (R-project, 2023). The data set does not provide any participant identification. For the purposes of this study, the participants are anonymous. Therefore, the ethical principle of anonymity is upheld (O’Leary, 2017). FINDINGS The summary statistics are displayed in Figure 2. The mean for the number of pregnancies of the participants is 3.3. The number of pregnancies for participants ranged from 0 to 17. The mean glucose value is 122.6. The glucose value ranged from 56 to 198. The mean blood pressure is 70.66 with a variable range of 24 to 110. The mean for the triceps skinfold is 29.15 with a variable range of 7 to 63. The mean for insulin is 156.06 with a variable range of 14 to 846. The mean for body mass index is 33.09 with a variable range of 18.20 to 67.10. The mean for pedigree is 0.52 with a variable range of 0.08 to 2.42. The mean age is 30.86 with a variable range of 21 to 81. 262 participants did not have diabetes and 130 did have diabetes.
Figure 2 Summary Statistics of Pima Indians Diabetes Data Set
Figure 3 displays the correlation matrix for the data set. Values closest to positive 1 are selected. Glucose and insulin have a correlation value of 0.6. Age and pregnancy had a correlation value of 0.7. Therefore, the first null hypothesis is not supported. Figure 4 displays the distribution of the Age variable. The Age data is skewed to the right. Figure 5 displays the distribution of the Glucose variable. The Glucose variable is skewed to the right. Figure 6 displays the distribution of the Insulin variable. The data for Insulin is skewed to the right. Figure 7 displays the distribution of Pregnancies. The Pregnancy variable is skewed to the right.
Figure 3 Correlation Matrix
Figure 4 Distribution for Age
Figure 5 Distribution of Glucose
Figure 6 Distribution of the Insulin Variable
Figure 7 Distribution of Pregnancies
There are two versions of the Classification Tree model. Figure 8 displays the variables the model selected using the train partition of the data set. The model displays the glucose, insulin, age, and pregnancy status that will determine the diagnosis of diabetes mellitus. Figure 9 displays the model constructed with the test partition of the data set. The glucose, age, and pregnancy status were the final variables selected to predict the diabetes status. The sensitivity value for the Train model was 0.73 (Figure 10). The specificity value for the train model was .91 (Figure 10). For the test model, the sensitivity improved with a value of 0.75 (Figure 11). The specificity was reduced to 0.89 (Figure 11). The variables selected by the Classification Tree model align with the variables found to have high correlation values. Therefore, the second null hypothesis is not supported.
Figure 8 Classification Tree from Train Data
Figure 9 Classification Tree from Test Data
Figure 10 Confusion Matrix, Sensitivity, and Specificity Testing for the Train Partition Tree
Figure 11 Confusion Matrix, Sensitivity, and Specificity Testing for the Test Partition Tree
CONCLUSION Medical professionals may find the outcome of this study useful because data mining can be used to predict diabetes. The symptoms of diabetes mellitus are known. Symptoms appear slowly after years of developing the disease. Treating the patient prior to the onset of symptoms may help improve the patient’s prognosis. The variables in the data set may be used to determine which patients have diabetes mellitus. The data analysis may provide a foundation or reference for future studies. Diabetes is a pandemic with many complex contributing factors that require analysis. The minority populations are hit hard and do not have the resources available to effectively combat the disease. A fresh perspective and an alternative form of analysis may provide new insights or lend credibility to previous analytical studies of the Pima Indian Diabetes data set. The classification tree model was useful in determining which variables in the data set predict diabetes. Insulin levels, Age, Pregnancy, and Glucose levels may be used to determine the risk of developing type 2 diabetes mellitus.
RECOMMENDATIONS Further research using data mining techniques may be valuable in providing more evidence that diabetes is predictable. Comparison models, different partition parameters, and accuracy tests may be used to explore the Pima Indians Diabetes data set. In the future, an alternative approach and a different data set may be used to add to the overall research. For example, a data set containing metabolite biomarkers could explore the topic at a more intricate level using data analysis (Long et al., 2020). There are many variables associated with diabetes that merit the continued investigation using machine learning, artificial intelligence, and any other form of analysis (Luo et al., 2023).
REFERENCES Abbasi, A., Sahlqvist, A.-S., Lotta, L., Brosnan, J. M., Vollenweider, P., Giabbanelli, P., Nunez, D. J., Waterworth, D., Scott, R. A., Langenberg, C., & Wareham, N. J. (2016). A Systematic Review of Biomarkers and Risk of Incident Type 2 Diabetes: An Overview of Epidemiological, Prediction and Aetiological Research Literature. PLoS ONE, 11(10), 120. https://doi.org/10.1371/journal.pone.0163721 Anonymous. (2002). Reduction in the incidence of type 2 diabetes with lifestyle intervention or met form in. The New England Journal of Medicine, 346(6), 393-403. https://csuglobal.idm.oclc.org/login?qurl=https%3A%2F%2Fwww.proquest.com%2Fsc arlyjournals%2Freduction-incidence-type- diabeteswith%2Fdocview%2F223937844%2Fse2%3Faccountid%3D38569 Chang, V., Bailey, J., Xu, Q. A., & Sun, Z. (2022). Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural computing & applications, 1–17. Advance online publication. https://doi.org/10.1007/s00521-022-07049-z Deepa, K., & Ranjeeth Kumar, C. (2023). Early diagnosis of diabetes mellitus using data mining and machine learning techniques. Journal of Intelligent & Fuzzy Systems, 44(3), 3999–4011. https://doi.org/10.3233/JIFS-222574 Kaggle. (2016). Pima Indians Diabetes Data Set. Retrieved from https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine Learning and Data Mining Methods in Diabetes Research. Computational and Structural Biotechnology Journal, 15, 104–116. https://doi.org/10.1016/j.csbj.2016.12.005 Klandorf, H., Stark, S. W., (2022). Diabetes mellitus. Magill’s Medical Guide (Online Edition). https://eds/s-ebscohost.com.csuglobal.idm.oclc.org/eds/detail/detail?vid=1&sid= 3d6f0d06-8b3d-402aa0a7980652984094%40redis&bdata=JnNpdGU9ZWRzLWxpdmU%3d#AN= 86194054&db=ers Long, J., Yang, Z., Wang, L., Han, Y., Peng, C., Yan, C., & Yan, D. (2020). Metabolite biomarkers of type 2 diabetes mellitus and pre-diabetes: a systematic review and meta-analysis. BMC endocrine disorders, 20(1), 174. https://doi.org/10.1186/s12902-020-00653-x Luo, X., Sun, J., Pan, H., Zhou, D., Huang, P., Tang, J., Shi, R., Ye, H., Zhao, Y., & Zhang, A. (2023). Establishment and health management application of a prediction model for high-risk complication combination of type 2 diabetes mellitus based on data mining. PLoS ONE, 17(8), 1 18. https://doi.org/10.1371/journal.pone.0289749 R-Project. (n.d.). Pima Indians Diabetes Database. R Documentation. Retrieved from https://search.r-project.org/CRAN/refmans/mlbench/html/PimaIndiansDiabetes.html Vital Signs. (2018). Native Americans with Diabetes: Better diabetes care cand decrease kidney disease. Centers for Disease Control and Prevention. Retrieved from https://www.cdc.gov/vitalsigns/aian-diabetes/index.html Worlds Health Organization. (2016). Global Report on Diabetes. https://www.who.int/publications/i/item/9789241565257 Zhang, L., & Liu, M. (2022). Analysis of Diabetes Disease Risk Prediction and Diabetes Medication Pattern Based on Data Mining. Computational & Mathematical Methods in Medicine, 1–9. https://doi.org/10.1155/2022/2665339
The Pima Indians Diabetes data set was explored using R, SAS, and Tableau. A Classification tree model was created using RStudio