jackjindtnt / KI

Ki
0 stars 0 forks source link

pandas #25

Open jackjindtnt opened 6 years ago

jackjindtnt commented 6 years ago

import pandas as pd import numpy as np from xlsxwriter.utility import xl_rowcol_to_cell df = pd.read_excel("excel-comp-datav2.xlsx") df.head() number_rows = len(df.index)

Add some summary data using the new assign functionality in pandas 0.16

df = df.assign(total=(df['Jan'] + df['Feb'] + df['Mar'])) df.head() We can also use assign to show how close accounts are towards their quota. df = df.assign(quota_pct=(1+(df['total'] - df['quota'])/df['quota'])) df.head() df["total"] = df["Jan"] + df["Feb"] + df["Mar"]

Save this data using the simple to_excel process.

writer_orig = pd.ExcelWriter('simple.xlsx', engine='xlsxwriter') df.to_excel(writer_orig, index=False, sheet_name='report') writer_orig.save()

We create another writer and use the to_excel to create our workbook. writer = pd.ExcelWriter('fancy.xlsx', engine='xlsxwriter') df.to_excel(writer, index=False, sheet_name='report') workbook = writer.book worksheet = writer.sheets['report'] worksheet.set_zoom(90)

Total formatting

total_fmt = workbook.add_format({'align': 'right', 'num_format': '$#,##0', 'bold': True, 'bottom':6})

Total percent format

total_percent_fmt = workbook.add_format({'align': 'right', 'num_format': '0.0%', 'bold': True, 'bottom':6})

Account info columns

worksheet.set_column('B:D', 20)

State column

worksheet.set_column('E:E', 5)

Post code

worksheet.set_column('F:F', 10)

Monthly columns

worksheet.set_column('G:K', 12, money_fmt)

Quota percent columns

worksheet.set_column('L:L', 12, percent_fmt)

Add total rows

for column in range(6, 11):

Determine where we will place the formula

cell_location = xl_rowcol_to_cell(number_rows+1, column)
# Get the range to use for the sum formula
start_range = xl_rowcol_to_cell(1, column)
end_range = xl_rowcol_to_cell(number_rows, column)
# Construct and write the formula
formula = "=SUM({:s}:{:s})".format(start_range, end_range)
worksheet.write_formula(cell_location, formula, total_fmt)

Add a total label

worksheet.write_string(number_rows+1, 5, "Total",total_fmt) percent_formula = "=1+(K{0}-G{0})/G{0}".format(number_rows+2) worksheet.write_formula(number_rows+1, 11, percent_formula, total_percent_fmt)

Define our range for the color formatting

color_range = "L2:L{}".format(number_rows+1)

Define our range for the color formatting

color_range = "L2:L{}".format(number_rows+1)

Highlight the top 5 values in Green

worksheet.conditional_format(color_range, {'type': 'top', 'value': '5', 'format': format2})

Highlight the bottom 5 values in Red

worksheet.conditional_format(color_range, {'type': 'bottom', 'value': '5', 'format': format1}) writer.save()

jackjindtnt commented 6 years ago

Excel “Filter and Edit” - Demonstrated in Pandas import pandas as pd sales = [('account', ['Jones LLC', 'Alpha Co', 'Blue Inc', 'Mega Corp']), ('Total Sales', [150, 200, 75, 300]), ('Country', ['US', 'UK', 'US', 'US'])] df = pd.DataFrame.from_items(sales) indices = [True, False, True, True] df[indices] df.Country == 'US' df[df["Country"] == 'US'] Selecting the Columns Now that we have figured out how to select rows of data, how can we control which columns to display? In the example above, there’s no obvious way to do that. Pandas can support this use case using three types of location based indexing: .loc , iloc , and .ix . These functions also allow us to select columns in addition to the row selection we have seen so far.

There is a lot of confusion about when to use .loc , iloc , or .ix . The quick summary of the difference is that:

.loc is used for label indexing .iloc is used for position based integers .ix is a shortcut that will try to use labels (like .loc ) but will fall back to position based integers (like .iloc ) So, the question is, which one should I use? I will profess that I get tripped up some times on this one too. I have found that I use .loc most frequently. Mainly because my data does not lend itself to meaningful position based indexing (in other words, I rarely find myself needing .iloc ) so I stick with .loc .

To be fair each of these methods do have their place and are useful in many situations. One area in particular is when dealing with MultiIndex DataFrames. I will not cover that topic in this article - maybe in a future post.

Now that we have covered this topic, let’s show how to filter a DataFrame on values in a row and select specific columns to display.

df.loc[[True, True, False, True], "account"] df.loc[[True, True, False, True], ["account", "Country"]] df.loc[df["Total Sales"] > 200, ["account", "Country"]] Editing Columns df["rate"] = 0.02 df.loc[df["Total Sales"] > 100, ["rate"]] = .05

======================================================== First, read in the Excel file and add a column with the 2% default rate: import pandas as pd df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-sales-reps.xlsx?raw=true") df["commission"] = .02 df.head() df.loc[df["category"] == "Shirt", ["commission"]] = .025 df.loc[(df["category"] == "Belt") & (df["quantity"] >= 10), ["commission"]] = .04 df.head() df["bonus"] = 0 df.loc[(df["category"] == "Shoes") & (df["ext price"] >= 1000 ), ["bonus", "commission"]] = 250, 0.045

Display a sample of rows that show this bonus

df.ix[3:7]

Calculate the compensation for each row

df["comp"] = df["commission"] * df["ext price"] + df["bonus"]

Summarize and round the results by sales rep

df.groupby(["sales rep"])["comp"].sum().round(2)

jackjindtnt commented 6 years ago

=============================== f = open("filename.txt", 'w') f.write(ab+"\n") f.close()

Creating Advanced Excel Workbooks with Python

import pandas as pd

sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true') sales_summary = sales_df.groupby(['name'])['ext price'].agg(['sum', 'mean'])

Reset the index for consistency when saving in Excel

sales_summary.reset_index(inplace=True) writer = pd.ExcelWriter('sales_summary.xlsx', engine='xlsxwriter') sales_summary.to_excel(writer, 'summary', index=False) writer.save()

def format_excel(writer): """ Add Excel specific formatting to the workbook """

Get the workbook and the summary sheet so we can add the formatting

workbook = writer.book
worksheet = writer.sheets['summary']
# Add currency formatting and apply it
money_fmt = workbook.add_format({'num_format': 42, 'align': 'center'})
worksheet.set_column('A:A', 20)
worksheet.set_column('B:C', 15, money_fmt)
worksheet.add_table('A1:C22', {'columns': [{'header': 'account',
                                            'total_string': 'Total'},
                                           {'header': 'Total Sales',
                                            'total_function': 'sum'},
                                           {'header': 'Average Sales',
                                            'total_function': 'average'}],
                               'autofilter': False,
                               'total_row': True,
                               'style': 'Table Style Medium 20'})

Applying the function is straightforward:

sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true') sales_summary = sales_df.groupby(['name'])['ext price'].agg(['sum', 'mean'])

Reset the index for consistency when saving in Excel

sales_summary.reset_index(inplace=True) writer = pd.ExcelWriter('sales_summary.xlsx', engine='xlsxwriter') sales_summary.to_excel(writer, 'summary', index=False) format_excel(writer) writer.save()

jackjindtnt commented 6 years ago

Learn More About Pandas By Building and Using a Weighted Average Function

import pandas as pd import numpy as np

sales = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sales-estimate.xlsx?raw=True", sheetname="projections") sales.head() If we want to determine a simple mean, we can use the built in functions to easily calculate it: sales["Current_Price"].mean() sales["New_Product_Price"].mean() In order to calculate a weighted average using the long approach:

(sales["Current_Price"] sales["Quantity"]).sum() / sales["Quantity"].sum() (sales["New_Product_Price"] sales["Quantity"]).sum() / sales["Quantity"].sum()

Grouping Data with the Weighted Average sales.groupby("Manager")["Current_Price"].mean()

def wavg(group, avg_name, weight_name): """ http://stackoverflow.com/questions/10951341/pandas-dataframe-aggregate-function-using-multiple-columns In rare instance, we may not have weights, so just return the mean. Customize this if your business case should return otherwise. """ d = group[avg_name] w = group[weight_name] try: return (d * w).sum() / w.sum() except ZeroDivisionError: return d.mean()