oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Inner Merge #387

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

If we just look at the orders table, we can really tell what happened in each order. However, if we refer to the other tables, we can get a more complete picture. The process of matching info from to different tables is called merging. Maybe is easy to do this kind of matching for one row, but hard to do it for multiple rows. Pandas can efficiently do this for the entire table. We use the .merge() method.

The .merge() method looks for columns that are common between two DataFrames and then looks for rows where those column's values are the same. It then combines the matching rows into a single row in a new table.

We can call the pd.merge() method with two tables like this:

new_df = pd.merge(orders, customers)

This will match up all of the customer information to the orders that each customer made.

oldoc63 commented 1 year ago

You are an analyst at Cool T-Shirts Inc. You are going to help them analyze some of their sales data. There are two DataFrames defined:

Create a new DataFrame sales_vs_targets which contains the merge of sales and targets.

Select the rows from sales_vs_targets where revenue is greater than target. Save these rows to the variable crushing_it.

oldoc63 commented 1 year ago

In addition to using pd.merge(), each DataFrame has it own .merge() method. For instance, to merge orders with customers:

oldoc63 commented 1 year ago

We generally use this when we are joining more than two DataFrames together because we can chain the commands. The following command would merge orders to customers, and then the resulting DataFrame to products:

oldoc63 commented 1 year ago

We have some more data from Cool T-Shirts Inc. The number of men's and women's t-shirts sold per month is in a file called men_women_sales.csv. Load this data into a DataFrame called men_women. Merge all three DataFrames (sales, targets, and men_women) into one big DataFrame called all_data.

oldoc63 commented 1 year ago

Cool T-Shirts Inc. thinks that they have more revenue in months where they sell more women's t-shirts.

Select the rows of all_data where:

Save your answer in the variable results.

oldoc63 commented 1 year ago

Merge on Specific Columns

In the previous example, the .merge() function "knew" how to combine tables based on the columns that were the same between two tables. For instance, products and orders both had a column called product_id. This won't always be true when we want to perform a merge.

Generally, the products and customers DataFrames would not have the columns product_id or customer_id. Instead, they would both be called id and it would be implied that the id was the product_id for the products table and customer_id for the customers table.

How would this affect our merges? Because the id columns would mean something different in each table, our default merges would be wrong.

One way that we could address this problem is to use .rename() to rename the column for our merges. In the example below, we will rename the column id to customer_id, so that orders and customers have a common column for the merge.

oldoc63 commented 1 year ago

In the previous exercise, we learned how to use .rename() to merge two DataFrames whose columns don't macht.

If we don't want to do that, we have another option. We could use the keywords left_on and right_on to specify which columns we want to perform the merge on. In the example below, the "left" table is the one that comes first (orders), and the "right" table is the one that comes second (customers).

oldoc63 commented 1 year ago

If we use this syntax, we'll end up with two columns called id, one from the first table and one from the second. Pandas won't let you to have two columns with the same id, so it will change them. We can provide a list of suffixes to use instead.

oldoc63 commented 1 year ago

Mismatched Merges

In our examples, there were always matching values when we were performing our merges. What happens when that isn't true? Let's imagine that our products table is out of date and is missing the newest product: Product 5. What happens when someone orders it?

oldoc63 commented 1 year ago

Merge orders and products and save it to the variable merge_df. Inspect merge_df. What happened to order_id 3?