jmbejara / comp-econ-sp18

Main Course Repository for Computational Methods in Economics (Econ 21410, Spring 2018)
16 stars 23 forks source link

Chipotle #22

Closed afgong closed 6 years ago

afgong commented 6 years ago

When we import the dataset, there are a few separators such as ',' and '\s+' that return errors in the code. However, although '|' works, the printed DataFrame is not very clean (see below). How does one determine the correct separator by looking at the raw data?

url = 'https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv' chipo = pd.read_csv(url, sep='|') chipo

screen shot 2018-04-18 at 09 29 38

jmbejara commented 6 years ago

If you put the url https://raw.githubusercontent.com/TheUpshot/chipotle/master/orders.tsv in the browser, it will open the file in the browser or prompt a download. If you download it, you can open it in a text editor. Also, in the image you provided, you can see what the correct separators should be. Though, I would recommend also opening it in a text editor or in the browser to see what it looks like.

Hope this helps!

afgong commented 6 years ago

Hi, another question! For #8, do you mean qualitatively "describe how the dataset is indexed"? For example, every index indicates the type and quantity of the menu item in a single order?

jmbejara commented 6 years ago

Yeah. I'm just looking for an plain-English description of the index.

jmbejara commented 6 years ago

Also, here is a reference for escape characters or "string literals" in Python: https://docs.python.org/2.0/ref/strings.html

afgong commented 6 years ago

To convert the item_price to a float, could one way be chopping off the $ for each price when it's a string, and then converting the string into a float?

jmbejara commented 6 years ago

Yep! That's right. There are going to be different ways to manipulate the string too. You might slice the strings or you might use the string method called split

ethanmetzger commented 6 years ago

For question 22 on the Chipotle data, do you want us to find the quantity associated with most expensive line on each receipt, or just the quantity with the most expensive line out of all the receipts?

afgong commented 6 years ago

In addition, are we permanently converting the item_price column to a float?

c = chipo.item_price.str.split("$").apply(''.join) d = pd.to_numeric(c) d

The code will print out the item_price column as a float, but I don't think it modifies the original DataFrame.

isabelalmazan commented 6 years ago

Hi, adding to the Chipotle question chain. Can you clarify the difference between what questions 5, 9 and 13 are asking for? (Right now they all seem to be asking for the number of orders/ rows?) Thanks!

jmbejara commented 6 years ago

@ethanmetzger On question 22, I'm looking for line item out of all possible line items. No grouping by receipts for this question. Just a single number out of the whole data set.

jmbejara commented 6 years ago

@afgong I replaced the column with my new float column, because the data is not useful to me as a string. You can create a new column if you like--it might be useful to create the new column so that you can compare and make sure it converted correctly. I just went ahead and replaced the column, though.

jmbejara commented 6 years ago

@isabelalmazan Each are different levels of grouping. The data is from GrubHub. Every time somebody makes an order, they get a unique order idea and a receipt for that order. Imagine what those receipts look like. This will help tell the difference.

afgong commented 6 years ago

On question 14, my code to filter out '6 Pack Soft Drink' is below. But, I am getting TypeError: Indexing a Series with DataFrame is not supported, use the appropriate DataFrame column. Is this because chipo['item_name'][x] isn't the proper syntax to return a Boolean value? When I write chipo['item_name'][1], for example, I get 'Izze'.

def filter_func(x): return chipo['item_name'][x] == '6 Pack Soft Drink'

chipo.groupby('item_name').filter(filter_func)

jmbejara commented 6 years ago

filter would not be the right method to use here. As the docs say, "this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index." https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.filter.html

Instead, you might use regular slicing combined with a boolean mask. Also, you could try out the query method: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html You can read about it here: https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html

BTW, check out the following formatting. You can press the "edit" button to see the associated markdown.

def filter_func(x):
      return chipo['item_name'][x] == '6 Pack Soft Drink'

chipo.groupby('item_name').filter(filter_func)
afgong commented 6 years ago

In the question 20 description, do you mean _ave_unitprice? We're trying to sort the DatFrame from lowest average unit price to highest average unit price? Wanted to clarify because question 21 then asks us to display the average item prices!

jmbejara commented 6 years ago

I'm sorry. This is my mistake. I wrote item price in some places where I meant unit prices. I have fixed this in this commit: df92714730092d345a533655d5a2d451e6f5154c