oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Summarizing a Single Feature #392

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

Introduction

Before diving into formal analysis with a dataset, it is often helpful to perform some initial investigations of the data through exploratory data analysis (EDA) to get a better sense of what you will be working with. Basic summary statistics and visualizations are important components of EDA as they allow us to condense a large amount of information into a small set of numbers or graphics that can be easily interpreted.

This lesson focuses on univariate summaries, where we explore each variable separately. This is useful for answering questions about each individual feature. Variables can typically be classified as quantitative (ie, numeric) or categorical (ie, discrete). Depending on its type, we may want to chose different summary metrics and visuals to use.

Let's say we have the following dataset on New York City rental listings imported into a pandas DataFrame (subsetted from the StreetEasy dataset):

oldoc63 commented 1 year ago

As seen, we have two quantitative variables (rent and size_sqft) and one categorical variable (borough). The pandas library offers a handy method .describe() for displaying some of the most common summary statistics for the columns in a DataFrame. By default, the result only includes numeric columns, but we can specify include='all' to the method to display categorical ones as well:

oldoc63 commented 1 year ago

This is a great way to get an overview of all the variables in a dataset. Notice how different statistics are displayed depending on the variable type.

oldoc63 commented 1 year ago

In script.py, we've imported a dataset containing information on the budget and earnings of movies from various genres into a DataFrame called movies.

Start by inspecting the first 5 rows of movies using the .head() method and print the result.

How many quantitative and categorical variables do you see?

oldoc63 commented 1 year ago

Use the .describe() method to display the summary statistics for movies and print the result. Make sure to show statistics for all columns in the DataFrame.

What kind of metrics are displayed for quantitative columns versus categorical columns?