rebeccajohnson88 / PPOL564_slides_activities

Repo for Georgetown McCourt's School of Public Policy's Data Science I (PPOL 564)
Creative Commons Zero v1.0 Universal
9 stars 13 forks source link

1.4 #36

Closed sonali-sr closed 1 year ago

sonali-sr commented 1 year ago

1.4 Filter out duplicates from original debar data (6 points)

A. Using mult_debar_wide, add a column is_dup that takes value of True for cases where start_date_viol1 == start_date_viol2 marking the row as a duplicate

1 4 a

B. Going back to the original long-format data you loaded at the beginning -debar- For employers where is_dup == True as indicated by your wide-format dataframe, only keep violnum == viol1 - For all other employers (so is_dup == False and ones we didnt need to check duplicates for), keep all violnum - Remove the is_repeated column from the debar data

Hint: you can complete part B without a for loop; pd.concat with axis = 0 (row binding) is one way

Call the resulting dataframe debar_clean and print the shape and # of unique employer names

1 4 b
sanhatahir commented 1 year ago

I've dropped the debar_is_repeated column and it shows like that in the dataframe as well, but the shape is still being given as 97x6 Is it the drop function that's not working as I intend it to?

Code: debar_clean.drop(columns = "debar_is_repeated") debar_clean.shape

Picture: image

sonali-sr commented 1 year ago

Hi @sanhatahir - I think you just have to save the result. I would recommend trying : debar_clean = debar_clean.drop(columns = "debar_is_repeated")

sanhatahir commented 1 year ago

Oh, you're completely right I should get some sleep Thanks so much!

Mag-Sul commented 1 year ago

I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice?

Screen Shot 2022-10-19 at 10 29 45 PM
rebeccajohnson88 commented 1 year ago

I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice? Screen Shot 2022-10-19 at 10 29 45 PM

The pd.concat hint corresponds to an approach where we:

debar_keepviol1 = debar[(debar.Name.isin(dup_names)) & (debar.viol_num == "viol1")].copy()
debar_keepallviols = debar[~debar.Name.isin(dup_names)].copy()

the for loop is correct intuition but i'm not sure which part of the for loop is creating the error (the if part or the drop part),--- but the above row filtering is doing what the for loop is aiming at more succinctly so i'd try the row filtering approach and post follow up q's as needed

sirro9 commented 1 year ago
Screen Shot 2022-10-20 at 9 24 15 AM

for question 1.4, why would the result be 94 instead of 97. I just deleted all the viol2 rows

Mag-Sul commented 1 year ago

I'm stuck on this question (part B). I don't really understand how/why to use pd.concat in this case, so I've been trying to use a for loop but keep getting an error. Any advice? Screen Shot 2022-10-19 at 10 29 45 PM

The pd.concat hint corresponds to an approach where we:

  • go back to original debar data
  • use row filtering to separate that data into two dataframes: (1) debar_keepallviols and (2) debar_keepviol1 --- the first one is defined by having a name not in the list of duplicated names; the second one by having a name in the list of duplicated names- so you can do something like the follows-- you can get dup_names just by dup.Name and don't necessarily need the list and values commands in your code
debar_keepviol1 = debar[(debar.Name.isin(dup_names)) & (debar.viol_num == "viol1")].copy()
debar_keepallviols = debar[~debar.Name.isin(dup_names)].copy()

the for loop is correct intuition but i'm not sure which part of the for loop is creating the error (the if part or the drop part),--- but the above row filtering is doing what the for loop is aiming at more succinctly so i'd try the row filtering approach and post follow up q's as needed

Ah this makes sense and worked - thanks so much! When I tried concat initially, I was doing it as the first step and couldn't figure it out from there. Doing the subsetting of the dfs first makes much more sense.

rebeccajohnson88 commented 1 year ago

Question from student:

for question 1.3. is there a reason why i got a 94 insead of 97?

image

response:

there are 94 unique employers but 97 rows; for employers whose rows are not duplicated (so employers who do NOT have the same start and end date for the violation) you should retain both viol1 and viol2. your code screenshot seems to remove viol2 for all employers, which is not correct.