sapphirachan / FrancoSapphiraAdvDA

0 stars 0 forks source link

3.3 Data Pre-processing #14

Open sapphirachan opened 5 years ago

flfguerrero commented 5 years ago

Using classification rules going forward to analyse our dataset., we have chosen the binary variable 'class'. This vriable indicatesn whether the toddler displays ASD behavioural traits based off the answers chosen.

Using this variable forces us to forfeit using the 'score' varible (representing a numeric score of the test taken). This score presents a threshold, which is exceed, classifies a toddler as having ASD synptoms. This information re-iterates the results of the class variable, and needs to be removed to present overfitting our model

sapphirachan commented 5 years ago

we need to ensure that the values for each attribute have to be defined with the correct data type.

sapphirachan commented 5 years ago

Data pre-processing is required to ensure that raw data does not impact on the quality of the output due to the presence of outliers, missing values and any other inconsistencies in the dataset. It also involves data transformation, data integration and data reduction.

Data Transformation Data Transformation is the process of converting the data to another format or structure for the purpose of analysis. We received the data in the CSV file format, which is a common file format for data analysis tools, we do not foresee the need to convert this into any other formats.

Data Integration Data Integration is the process of combining datasets gathered from different sources and format into one common dataset to facilitate data analysis. In this case, the dataset only contains one file, and hence no data integration will be necessary.

Data Reduction Data reduction which is the process of reducing the dataset will be conducted with feature selection. As the full dataset only contains a total of 1054 instances, it will not be necessary to reduce the number of instances as this is well within the computational capabilities of data analysis tools we will be using. However, for the purpose of improving the output of the data analysis, we will perform dimensionality reduction using feature selection methods to trim down the number of attributes.