How EDA can inform the data cleaning process

oldoc63 commented 1 year ago

One of the most challenging parts of data cleaning is diagnosing data issues and figuring out HOW to most effectively address them. In order to accomplish this, exploratory data analysis (EDA) can be a extremely useful tool. We'll walk through an example dataset to demonstrate how EDA can inform the initial data inspection, cleaning and validation process.

Every dataset is different, and therefore will require different exploration. EDA is all about following the data, verifying your assumptions, and investigating anything that is unexpected.

oldoc63 commented 1 year ago

Initial Data Inspection

Before analysis or cleaning, it is useful to print a few rows of data. This helps ensure that the data is properly loaded. It also allow us to compare the observed data to the data dictionary and determine whether the coding appears to mach our expectations. For example, let's load and inspect the first few rows of a dataset of heart disease patients (downloaded from the UCI Machine Learning Repository).

oldoc63 commented 1 year ago

There are a few things we might want to inspect. For example, the data dictionary gives the following information about the cp column: cp = chest pain type

Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic

Based on this information, it's not necessarily clear whether the data is going to be coded as numerical values (e.g., 1, 2, 3, or 4) or with strings (e.g., 'typical angina'). Data inspection allow us to clarify that this column contains numerical values.

Similarly, there is some conflicting information in the data dictionary about the target column (coded as num). The list of features contains the following information about this column: num = diagnosis of heart disease (angiographic disease status)

Value 0: < 50% diameter narrowing
Value 1: > 50% diameter narrowing

However, the initial data description suggest that the target field is integer valued from 0-4, whereas 0 indicates no heart disease.

By inspecting the inspecting the first few rows of data, we see at least one instance of the value 2 in the heart_disease column. This suggests that the values probably range from 0-4 instead of just 0-1. We could verify this with further exploration (e.g., by using heart.num.value_counts() to get a table of values in this column).

oldoc63 commented 1 year ago

Data Information

Once we've taken a first look at some data, a common next step is to address questions such as:

How many (non-null) observations do we have?
How many unique columns/features do we have?
Which column (if any) contain missing data?
What is the data type of each column?

Using pandas, we can easily address these questions using the .info() method.

oldoc63 commented 1 year ago

There are a few interesting pieces of information that we can glean from this output:

There are 303 rows and 14 columns of data
At first glance, there are no null (i.e., missing) values in any column.
The ca and thal columns have a data type of object (which suggest that they are strings), even though we saw in our initial inspection that these columns appear to contain numerical values.

To investigate the unexpected output here, we might want to take a look at the unique values in the ca column:

oldoc63 commented 1 year ago

We note that at least one row contains a '?' in this column. We can probably assume that this indicates mis-coded missing data. The '?' also probably forced the column to be coded as a string because there is no obvious way to cast a '?' to a numerical value.

Given this information, we can replace any instance of '?' with np.NaN, change the data type of this column back to a float or integer, and then re-print the heart.info() to determine how many missing values we've got. Then, we probably want to do a similar inspection of the thal column.

oldoc63 commented 1 year ago

After identifying that there is some missing data and converting it to a format that Python can recognize, it's often a good idea to take a closer look at those rows. Sometimes, we can find clues as to WHY the data is missing, which can help us make decisions about whether to get rid of the rows altogether or impute the missing values somehow.

oldoc63 commented 1 year ago

Looking at this output, we note that there is no overlap between the rows with missing ca data and missing thal data. These suggest that these patients are missing ca and thal information for different reasons. We don't see any inmediate clues as to why the data is missing in the first place, but we can inspect this further once we start digging into individual features.

oldoc63 / learningDS

How EDA can inform the data cleaning process #399

Initial Data Inspection

Data Information