A brief overview of the data wrangling process and tidy data

oldoc63 commented 1 year ago

Frequently when we work with data, we encounter unstructured and/or messy data. Messy data can take a variety of forms. This could mean any of the following:

The columns are mislabeled or do not have variable names.
The dataset contains nonsensical data.
Variables are stored in both the columns and rows.

While the data may be messy, it is still extremely informative. We need to clean, transform, and sometimes manipulate the data structure to gain any insights. This process is often called data wrangling or data munging. At the final stages of the data wrangling process, we will have a dataset that we can easily use for modeling purposes or for visualization purposes. This is a tidy dataset where each column is a variable and each row is an observation.

oldoc63 commented 1 year ago

Data Wrangling

Let’s look at a subset of restaurant inspections from the New York City Department of Health and Mental Hygiene (NYC DOHMH) to work through some data wrangling processes. The data includes seven different columns with information about a restaurant’s location and health inspection. Here is a description of the dataset’s variables.

oldoc63 commented 1 year ago

Preliminary data cleaning

There are also duplicate rows for the restaurant labeled Seamore’s. To remove any duplicate rows, we can use the drop_duplicates() function

oldoc63 commented 1 year ago

Data Types

Let’s take a look at each column’s data types by appending .dtypes to our pandas dataframe.

oldoc63 commented 1 year ago

Since we have both continuous (float64) and categorical (object) variables in our data, it might be informative to look at the number of unique values in each column using the nunique() function.

oldoc63 commented 1 year ago

Missing Data

To do this we can use isna() to identify if the value is missing. This will give us a boolean and indicate if the observation in that column is missing (True) or not (False). We will also use sum() to count the number of missing values, where isna() returns True.

oldoc63 commented 1 year ago

Characterizing missingness with crosstab

Let’s try to understand the missingness in the url column by counting the missing values across each borough. We will use the crosstab() function in pandas to do this.

The crosstab() computes the frequency of two or more variables. To look at the missingness in the url column we can add isna() to the column to identify if there is an NaN in that column. This will return a boolean, True if there is a NaN and False if there is not. In our crosstab, we will look at all the boroughs present in our data and whether or not they have missing url links.

oldoc63 commented 1 year ago

Removing prefixes

oldoc63 commented 1 year ago

Tidy Data

Let’s take a look at a dataset that has information about the average annual wage for restaurant workers across New York City boroughs and New York City as a whole from the years 2000 and 2007. The data is from the New York State Department of Labor, Quarterly Census of Employment and Wages,and only contains four total rows.

oldoc63 commented 1 year ago

There are three variables in this dataset: borough, year, and average annual income. However, we have values (2000 and 2007) in the column headers rather than variable names (year and average annual income). This is not ideal to work with, so let’s fix it! We will use the melt() function in pandas to turn the current values (2000 and 2007) in the column headers into row values and add year and avg_annual_wage as our column labels.

oldoc63 commented 1 year ago

Now we have a tidy dataset where each column is a variable (borough, year, or average annual wage), and each row is an observation! This dataset will be much easier to work with moving forward!

You now have the tools to continue to gather more information about New York City restaurants to answer questions you are interested in exploring. For example, we could explore what variables are most predictive of receiving a passing health score. We would need to gather some more data and go through the data wrangling process to end up with tidy data that is ready for analysis.

oldoc63 / learningDS