3.4 Feature Analysis - Githubissues

sapphirachan commented 6 years ago

Write Description of what Logistic Regression(Sapphira) and mRMR (Franco)

sapphirachan commented 6 years ago

As mentioned earlier, we have decided to choose Logistic Regression and mRMR for feature selection stage, this will be followed by Classification of the two outputs from the feature selection methods with Support Vector Machine and Naïve Bayes.

Feature Selection: Why did we select Logistic Regression & mRMR?

Feature selection refers to the process of selecting the independent attributes that have the most significant impact on the dependent attribute, and hence also regarded as the more statistically influential attributes.

The purpose of feature selection is to ensure that the model is efficient, less redundancies that could affect the decision making process exist and most importantly increases accuracy of the model.

Figure: Attributes & Class of Dataset’s Input & Output

Logistic Regression Logistic regression is one of the most frequently used method for binary classification in machine learning, i.e. used when the outcomes are discrete, being only one or the other. It is a method that is simple and can be applied relatively easily.

How does it work? Logistic regression works by measuring the relationship between the independent variables and the dependent variable of an instance. This measure is expressed in terms of probability. In this dataset, the dependent variable is the output that we are trying to predict, which is the “Yes” or “No” value to the questions on whether or not the child has ASD.

Preparation for Logistic Regression

To prepare the dataset for logistic regression, there is no need to scale, however the removal of non-discrete attributes is required. As the full dataset contains attributes that are not discrete in nature, we will remove these attributes form the dataset. This is “Age of toddler” which is continous, containing values between 12 and 36 months.

Prior preparation includes the removal of the “Score” attribute as already identified earlier, whereby removal is necessary to avoid repetition of the data which can result in overfitting.

Limitations of Logistic Regression

A limitation of the logistic regression method is its inability to solve non-linear problems, as such, it can only be used at the feature selection stage. Thereafter, we will need to use other methods to tackle the non-linear aspects of the dataset using methods such as decision tree.

flfguerrero commented 6 years ago

Keywords: greedy serach, objective function, relevancy, reducdency

mRMR (Minimum Redundancy and Maximum Relevance) feature selection is an approach proposed by Peng et al. mRMR is a machine learning algorithm that provides an output of selected features with a high correlation with the class variable (referring to maximum relevance) and a low correlation between themselves (minimum relevancy). This is calculated through the F-statistic (given continuous variables) or mutual information (given discrete variables) given the provided dataset.

Given the variables present within our dataset, we would need to utilise mRMR to identify the mutual information between our variables, to measure the level of smimilarity between them.

These variables are selected through a greedy search algorithm to maximise the objective functions. These functions are the MIQ (Mutual Information Quotient) and the MID (Mutual Information Difference) and are used to represent the quotient and difference of relevance and redundancy, respectively.

http://ranger.uta.edu/~chqding/papers/gene_select.pdf

https://ac.els-cdn.com/S2212017313004866/1-s2.0-S2212017313004866-main.pdf?_tid=364bb185-0b62-483b-918e-c7df08cc6739&acdnat=1533649127_43171eafed7469357ae8eae46f11dd8f

sapphirachan commented 6 years ago

Compare & Contrast Logistic Regression & mRmR

sapphirachan commented 6 years ago

Classification: Why did we select Support Vector Machine and Naïve Bayes?

Support Vector Machine Works by finding the largest minimum distance between two classes, based on the training dataset.

Naïve Bayes It is based on probability, where each attribute’s value contributes independently to the probability to the output, with complete disregard for the possible correlation that could exist between the values of these attributes. It is a highly competitive method to SVM and is commonly used in medical diagnosis, which is aligned to content of our dataset. Additionally, it does not require long processing times as required by time consuming iterative methods

Comparison of Support Vector Machine & Naïve Bayes methods? SVM is able to produce non-linear classification, however Naïve Bayes is only able to classify linearly, i.e. on straight lines.

sapphirachan / FrancoSapphiraAdvDA

3.4 Feature Analysis #15