oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Create a single number that describes a group of numbers #383

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

In this lesson, you will learn about aggregates in Pandas. An aggregate statistic is a way of creating a single number that describes a group of numbers. Common aggregates statistics include mean, median and standard deviation.

You will also learn how to rearrange a DataFrame into a pivot table, which is a great way to compare data across two dimensions.

oldoc63 commented 1 year ago

Calculating column statistics

Aggregate functions summarize many data points (i.e., a column of a DataFrame) into a smaller set of values.

oldoc63 commented 1 year ago

In general, we use the following syntax to calculate aggregates:

df.groupby('column1').column2.measurement()
oldoc63 commented 1 year ago

After using groupby, we often need to clean our resulting data.

As we saw in the previous exercise, the groupby function creates a new Series, not a DataFrame. For our example, the indices of the Series were different values of shoe_type, and the name property was price.

Usually, we'd prefer that those indices were actually a column. In order to get that, we can use reset_index(). This will transform our Series into a DataFrame and move the indices into their own column.

Generally, you'll always see a groupby statement followed by reset_index:

df.groupby('column1').column2.measurent().reset_index()

When we use groupby, we often want to rename the column we get as a result. For example, suppose we have a DataFrame teas containing data on types of tea:

oldoc63 commented 1 year ago

We want to find the number of each category of tea we sell:

oldoc63 commented 1 year ago

The new column contains the counts of each category of tea. However, this column is called id because we used the id column of teas to calculate the counts. We actually want to call this column counts:

oldoc63 commented 1 year ago

Modifiy your code from the previous exercise so that it ends with reset_index, which will change pricey_shoes into a DataFrame:

oldoc63 commented 1 year ago

Sometimes, the operation that you want to perform is more complicated than mean or count. In those cases, you can use the apply method and lambda functions, just like we did for individual column operations. Note that the input to our lambda function will always be a list of values.

A great example of this is calculating percentiles. Let's return to the data from shoefly.csv. Our marketing team says that it's important to have some affordably priced shoes available for every color of shoe that we sell. Calculate the 25th percentile for shoe price for each shoe_color to help Marketing decide if we have enough cheap shoes on sale. Save the data to the variable cheap_shoes. Be sure to use reset_index() at the end of your query so that cheap_shoes is a DataFrame. Then display cheap_shoes using print.

oldoc63 commented 1 year ago

Sometimes, we want to group by more than one column. We can easily do this by passing a list of column names into the groupby method.

Imagine that we run a chain of stores and have data about the number of sales at different locations on different days:

oldoc63 commented 1 year ago

We suspect that sales are different at different locations on different days of the week. In order to test this hypothesis, we could calculate the average sales for each store on each day of the week across multiple months:

oldoc63 commented 1 year ago

At ShoeFly.com, our Purchasing team thinks that certain shoe_type / shoe_color combinations are particularly popular this year.

Create a DataFrame with the total number of shoes of each shoe_type / shoe_color combination purchased. Save it to the variable shoe_counts.

You should be able to do this using groupby and count().

When we're using count(), it doesn't really matter which column we perform the calculation on.