Open jackjindtnt opened 6 years ago
Excel “Filter and Edit” - Demonstrated in Pandas import pandas as pd sales = [('account', ['Jones LLC', 'Alpha Co', 'Blue Inc', 'Mega Corp']), ('Total Sales', [150, 200, 75, 300]), ('Country', ['US', 'UK', 'US', 'US'])] df = pd.DataFrame.from_items(sales) indices = [True, False, True, True] df[indices] df.Country == 'US' df[df["Country"] == 'US'] Selecting the Columns Now that we have figured out how to select rows of data, how can we control which columns to display? In the example above, there’s no obvious way to do that. Pandas can support this use case using three types of location based indexing: .loc , iloc , and .ix . These functions also allow us to select columns in addition to the row selection we have seen so far.
There is a lot of confusion about when to use .loc , iloc , or .ix . The quick summary of the difference is that:
.loc is used for label indexing .iloc is used for position based integers .ix is a shortcut that will try to use labels (like .loc ) but will fall back to position based integers (like .iloc ) So, the question is, which one should I use? I will profess that I get tripped up some times on this one too. I have found that I use .loc most frequently. Mainly because my data does not lend itself to meaningful position based indexing (in other words, I rarely find myself needing .iloc ) so I stick with .loc .
To be fair each of these methods do have their place and are useful in many situations. One area in particular is when dealing with MultiIndex DataFrames. I will not cover that topic in this article - maybe in a future post.
Now that we have covered this topic, let’s show how to filter a DataFrame on values in a row and select specific columns to display.
df.loc[[True, True, False, True], "account"] df.loc[[True, True, False, True], ["account", "Country"]] df.loc[df["Total Sales"] > 200, ["account", "Country"]] Editing Columns df["rate"] = 0.02 df.loc[df["Total Sales"] > 100, ["rate"]] = .05
======================================================== First, read in the Excel file and add a column with the 2% default rate: import pandas as pd df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-sales-reps.xlsx?raw=true") df["commission"] = .02 df.head() df.loc[df["category"] == "Shirt", ["commission"]] = .025 df.loc[(df["category"] == "Belt") & (df["quantity"] >= 10), ["commission"]] = .04 df.head() df["bonus"] = 0 df.loc[(df["category"] == "Shoes") & (df["ext price"] >= 1000 ), ["bonus", "commission"]] = 250, 0.045
df.ix[3:7]
df["comp"] = df["commission"] * df["ext price"] + df["bonus"]
df.groupby(["sales rep"])["comp"].sum().round(2)
Creating Advanced Excel Workbooks with Python
import pandas as pd
sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true') sales_summary = sales_df.groupby(['name'])['ext price'].agg(['sum', 'mean'])
sales_summary.reset_index(inplace=True) writer = pd.ExcelWriter('sales_summary.xlsx', engine='xlsxwriter') sales_summary.to_excel(writer, 'summary', index=False) writer.save()
def format_excel(writer): """ Add Excel specific formatting to the workbook """
workbook = writer.book
worksheet = writer.sheets['summary']
# Add currency formatting and apply it
money_fmt = workbook.add_format({'num_format': 42, 'align': 'center'})
worksheet.set_column('A:A', 20)
worksheet.set_column('B:C', 15, money_fmt)
worksheet.add_table('A1:C22', {'columns': [{'header': 'account',
'total_string': 'Total'},
{'header': 'Total Sales',
'total_function': 'sum'},
{'header': 'Average Sales',
'total_function': 'average'}],
'autofilter': False,
'total_row': True,
'style': 'Table Style Medium 20'})
Applying the function is straightforward:
sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true') sales_summary = sales_df.groupby(['name'])['ext price'].agg(['sum', 'mean'])
sales_summary.reset_index(inplace=True) writer = pd.ExcelWriter('sales_summary.xlsx', engine='xlsxwriter') sales_summary.to_excel(writer, 'summary', index=False) format_excel(writer) writer.save()
Learn More About Pandas By Building and Using a Weighted Average Function
import pandas as pd import numpy as np
sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections") sales.head() If we want to determine a simple mean, we can use the built in functions to easily calculate it: sales["Current_Price"].mean() sales["New_Product_Price"].mean() In order to calculate a weighted average using the long approach:
(sales["Current_Price"] sales["Quantity"]).sum() / sales["Quantity"].sum() (sales["New_Product_Price"] sales["Quantity"]).sum() / sales["Quantity"].sum()
Grouping Data with the Weighted Average sales.groupby("Manager")["Current_Price"].mean()
def wavg(group, avg_name, weight_name): """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns In rare instance, we may not have weights, so just return the mean. Customize this if your business case should return otherwise. """ d = group[avg_name] w = group[weight_name] try: return (d * w).sum() / w.sum() except ZeroDivisionError: return d.mean()
import pandas as pd import numpy as np from xlsxwriter.utility import xl_rowcol_to_cell df = pd.read_excel("excel-comp-datav2.xlsx") df.head() number_rows = len(df.index)
Add some summary data using the new assign functionality in pandas 0.16
df = df.assign(total=(df['Jan'] + df['Feb'] + df['Mar'])) df.head() We can also use assign to show how close accounts are towards their quota. df = df.assign(quota_pct=(1+(df['total'] - df['quota'])/df['quota'])) df.head() df["total"] = df["Jan"] + df["Feb"] + df["Mar"]
Save this data using the simple to_excel process.
writer_orig = pd.ExcelWriter('simple.xlsx', engine='xlsxwriter') df.to_excel(writer_orig, index=False, sheet_name='report') writer_orig.save()
We create another writer and use the to_excel to create our workbook. writer = pd.ExcelWriter('fancy.xlsx', engine='xlsxwriter') df.to_excel(writer, index=False, sheet_name='report') workbook = writer.book worksheet = writer.sheets['report'] worksheet.set_zoom(90)
Total formatting
total_fmt = workbook.add_format({'align': 'right', 'num_format': '$#,##0', 'bold': True, 'bottom':6})
Total percent format
total_percent_fmt = workbook.add_format({'align': 'right', 'num_format': '0.0%', 'bold': True, 'bottom':6})
Account info columns
worksheet.set_column('B:D', 20)
State column
worksheet.set_column('E:E', 5)
Post code
worksheet.set_column('F:F', 10)
Monthly columns
worksheet.set_column('G:K', 12, money_fmt)
Quota percent columns
worksheet.set_column('L:L', 12, percent_fmt)
Add total rows
for column in range(6, 11):
Determine where we will place the formula
Add a total label
worksheet.write_string(number_rows+1, 5, "Total",total_fmt) percent_formula = "=1+(K{0}-G{0})/G{0}".format(number_rows+2) worksheet.write_formula(number_rows+1, 11, percent_formula, total_percent_fmt)
Define our range for the color formatting
color_range = "L2:L{}".format(number_rows+1)
Define our range for the color formatting
color_range = "L2:L{}".format(number_rows+1)
Highlight the top 5 values in Green
worksheet.conditional_format(color_range, {'type': 'top', 'value': '5', 'format': format2})
Highlight the bottom 5 values in Red
worksheet.conditional_format(color_range, {'type': 'bottom', 'value': '5', 'format': format1}) writer.save()